The white paper for the Pile, a dataset used for many #AI projects (including key #opensource AI efforts) is a fascinating read.
It's managed by a mysterious group called the Eye, it includes things like pirated books and subtitles (for pirated movies) and it has a very interesting take on open science, strongly skewed towards research freedom and pragmatism of accessing and using data.
What I want to highlight here is that the dataset also includes a corpus of @europarl_en from over 20 years.
Anyone who follows the EP legislative process knows it's extremely #oldschool, with little sense that there are collective knowledge management tools that could significantly facilitate the process, make it more transparent and engaging, etc.
So it's interesting to see it end up as a foundational bit for large language models.
I will now be imagining the outputs being just a bit tinged with European political talk.