101010.pl is one of the many independent Mastodon servers you can use to participate in the fediverse.
101010.pl czyli najstarszy polski serwer Mastodon. Posiadamy wpisy do 2048 znaków.

Server stats:

582
active users

#LLMs

27 posts21 participants1 post today

Tracing Chemical Knowledge Over Centuries with #LLMs 🧪

Diego Alves, Sergei Bagdasarov & Badr M. Abdullah prompted models to generate structured metadata for 47k+ texts from the #RoyalSociety Corpus (1665–1996), enabling large-scale comparison of #Chemistry and #Biology over time.

They tracked how chemical substances migrated between disciplines revealing a "chemicalization" of biology in the 19th century and a long-term trend toward standardization. #OpenScience #DiachronicAnalysis #NLP

The widespread belief that #LLMs will replace all of our jobs is the strongest example of the Dunning-Kruger effect in my lifetime.

This effect normally applies to a small number of rather stupid people. But it can also affect a society at large when we extrapolate a tech that no one properly understands.

This is one reason why UX people tend to be so reluctant to drink the Kool-Aid: we've been here many, many times.

Posting another #Introduction - plz boost far and/or wide!

#French-Born, #London-Based CompSci Teacher/Education PhD

#Education #Research #Phd, #BCS #Computing #Teacher #CCT
#CSEd #Programming #BCS
#ActuallyAutistic
#ActuallyADHD
I live with #MultipleSclerosis
#Zen / #Nonduality #Buddhist, weirdly into #Jung
#Research topics:
- #EdAI / #AIEd - #LLMs in #Education
- #CriticalStudies of #EdTech
- #Neurodiversity in #Education, and the experience of ND educators.

Proof or bluff? Evaluating LLMs on 2025 USA math olympiad. ~ Ivo Petrov et als. arxiv.org/abs/2503.21934 #LLMs #Math

arXiv logo
arXiv.orgProof or Bluff? Evaluating LLMs on 2025 USA Math OlympiadRecent math benchmarks for large language models (LLMs) such as MathArena indicate that state-of-the-art reasoning models achieve impressive performance on mathematical competitions like AIME, with the leading model, o3-mini, achieving scores comparable to top human competitors. However, these benchmarks evaluate models solely based on final numerical answers, neglecting rigorous reasoning and proof generation which are essential for real-world mathematical tasks. To address this, we introduce the first comprehensive evaluation of full-solution reasoning for challenging mathematical problems. Using expert human annotators, we evaluated several state-of-the-art reasoning models on the six problems from the 2025 USAMO within hours of their release. Our results reveal that all tested models struggled significantly, achieving less than 5% on average. Through detailed analysis of reasoning traces, we identify the most common failure modes and find several unwanted artifacts arising from the optimization strategies employed during model training. Overall, our results suggest that current LLMs are inadequate for rigorous mathematical reasoning tasks, highlighting the need for substantial improvements in reasoning and proof generation capabilities.

🔴 💻 **Are chatbots reliable text annotators? Sometimes**

“_Given the unreliable performance of ChatGPT and the significant challenges it poses to Open Science, we advise caution when using ChatGPT for substantive text annotation tasks._”

Ross Deans Kristensen-McLachlan, Miceal Canavan, Marton Kárdos, Mia Jacobsen, Lene Aarøe, Are chatbots reliable text annotators? Sometimes, PNAS Nexus, Volume 4, Issue 4, April 2025, pgaf069, doi.org/10.1093/pnasnexus/pgaf.

#OpenAccess #OA #Article #AI #ArtificialIntelligence #LargeLanguageModels #LLMS #Chatbots #Technology #Tech #Data #Annotation #Academia #Academics @ai

STP: Self-play LLM theorem provers with iterative conjecturing and proving. ~ Kefan Dong, Tengyu Ma. arxiv.org/abs/2502.00212 #AI #LLMs #ITP #LeanProver

arXiv logo
arXiv.orgSTP: Self-play LLM Theorem Provers with Iterative Conjecturing and ProvingA fundamental challenge in formal theorem proving by LLMs is the lack of high-quality training data. Although reinforcement learning or expert iteration partially mitigates this issue by alternating between LLM generating proofs and finetuning them on correctly generated ones, performance quickly plateaus due to the scarcity of correct proofs (sparse rewards). To keep improving the models with limited data, we draw inspiration from mathematicians, who continuously develop new results, partly by proposing novel conjectures or exercises (which are often variants of known results) and attempting to solve them. We design the Self-play Theorem Prover (STP) that simultaneously takes on two roles, conjecturer and prover, each providing training signals to the other. The conjecturer is trained iteratively on previously generated conjectures that are barely provable by the current prover, which incentivizes it to generate increasingly challenging conjectures over time. The prover attempts to prove the conjectures with standard expert iteration. We evaluate STP with both Lean and Isabelle formal versifiers. With 51.3 billion tokens generated during the training in Lean, STP proves 28.5% of the statements in the LeanWorkbook dataset, doubling the previous best result of 13.2% achieved through expert iteration. The final model achieves state-of-the-art performance among whole-proof generation methods on miniF2F-test (65.0%, pass@3200), Proofnet-test (23.9%, pass@3200) and PutnamBench (8/644, pass@3200). We release our code, model, and dataset in this URL: https://github.com/kfdong/STP.