Sascha Wolfer<p>Just published:</p><p>Supplementing CEFR-graded vocabulary lists for language learners by leveraging information on dictionary views, corpus frequency, part-of-speech, and polysemy</p><p>A machine-learning method to suggest word candidates for CEFR-graded vocabulary lists.</p><p><a href="https://doi.org/10.1057/s41599-025-05446-y" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">doi.org/10.1057/s41599-025-054</span><span class="invisible">46-y</span></a></p><p>- We compare 4 machine-learning algorithms: Regression trees, ordinal logistic regression, random forests, & naïve Bayes<br>- All are better than a random baseline (approx. double the accuracy).<br>- From these we use random forests (2k trees) to impute the <a href="https://fediscience.org/tags/CEFR" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>CEFR</span></a> level of previously unlabeled words</p><p><a href="https://fediscience.org/tags/linguistics" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>linguistics</span></a> <a href="https://fediscience.org/tags/CEFR" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>CEFR</span></a> <a href="https://fediscience.org/tags/frequency" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>frequency</span></a> <a href="https://fediscience.org/tags/dictionary" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>dictionary</span></a> <a href="https://fediscience.org/tags/LanguageLearning" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>LanguageLearning</span></a></p>