101010.pl is one of the many independent Mastodon servers you can use to participate in the fediverse.
101010.pl czyli najstarszy polski serwer Mastodon. Posiadamy wpisy do 2048 znaków.

Server stats:

486
active users

#crawling

0 posts0 participants0 posts today
Frontend Dogma<p>What Is llms.txt, and Should You Care About It?, by <span class="h-card" translate="no"><a href="https://mastodon.social/@ahrefs" class="u-url mention" rel="nofollow noopener" target="_blank">@<span>ahrefs</span></a></span>:</p><p><a href="https://ahrefs.com/blog/what-is-llms-txt/" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">ahrefs.com/blog/what-is-llms-t</span><span class="invisible">xt/</span></a></p><p><a href="https://mas.to/tags/ai" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>ai</span></a> <a href="https://mas.to/tags/crawling" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>crawling</span></a> <a href="https://mas.to/tags/robotstxt" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>robotstxt</span></a></p>
Frontend Dogma<p>Meet LLMs.txt, a Proposed Standard for AI Website Content Crawling, by @searchengineland.bsky.social:</p><p><a href="https://searchengineland.com/llms-txt-proposed-standard-453676" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">searchengineland.com/llms-txt-</span><span class="invisible">proposed-standard-453676</span></a></p><p><a href="https://mas.to/tags/ai" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>ai</span></a> <a href="https://mas.to/tags/crawling" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>crawling</span></a> <a href="https://mas.to/tags/scraping" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>scraping</span></a> <a href="https://mas.to/tags/robotstxt" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>robotstxt</span></a></p>
Frontend Dogma<p>Poisoning Well, by <span class="h-card" translate="no"><a href="https://front-end.social/@heydon" class="u-url mention" rel="nofollow noopener" target="_blank">@<span>heydon</span></a></span>:</p><p><a href="https://heydonworks.com/article/poisoning-well/" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">heydonworks.com/article/poison</span><span class="invisible">ing-well/</span></a></p><p><a href="https://mas.to/tags/ai" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>ai</span></a> <a href="https://mas.to/tags/crawling" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>crawling</span></a> <a href="https://mas.to/tags/robotstxt" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>robotstxt</span></a> <a href="https://mas.to/tags/content" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>content</span></a></p>
Frontend Dogma<p>Please Stop Externalizing Your Costs Directly Into My Face, by <span class="h-card" translate="no"><a href="https://cmpwn.com/@sir" class="u-url mention" rel="nofollow noopener" target="_blank">@<span>sir</span></a></span>:</p><p><a href="https://drewdevault.com/2025/03/17/2025-03-17-Stop-externalizing-your-costs-on-me.html" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">drewdevault.com/2025/03/17/202</span><span class="invisible">5-03-17-Stop-externalizing-your-costs-on-me.html</span></a></p><p><a href="https://mas.to/tags/ai" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>ai</span></a> <a href="https://mas.to/tags/crawling" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>crawling</span></a> <a href="https://mas.to/tags/traffic" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>traffic</span></a> <a href="https://mas.to/tags/economics" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>economics</span></a></p>
DaLetra Français<p>Découvrez les paroles de la chanson “Crawling” de Linkin Park<br><a href="https://flipboard.social/tags/LinkinPark" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>LinkinPark</span></a> <a href="https://flipboard.social/tags/Crawling" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Crawling</span></a><br><a href="https://daletra.art/linkin-park/paroles/crawling.html" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">daletra.art/linkin-park/parole</span><span class="invisible">s/crawling.html</span></a></p>
Tobias Köngeter<p>Who is familiar with <a href="https://sueden.social/tags/website" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>website</span></a> <a href="https://sueden.social/tags/tracking" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>tracking</span></a>? My website www.whatcolor.is currently has around 85,000 visitors per day, with most of them accessing all resources such as images, but also JSON documents and InDesign documents. That looks a lot like <a href="https://sueden.social/tags/crawling" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>crawling</span></a> to me. I use <a href="https://sueden.social/tags/Matomo" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Matomo</span></a>, which has been configured in accordance with German data protection regulations, so I don't have a lot of information about these visitors. How can I find out more about these visitors? What could they be?</p><p><a href="https://sueden.social/tags/helpwanted" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>helpwanted</span></a></p>
DaLetra<p>Confira a letra da música “Crawling” de Linkin Park<br><a href="https://flipboard.social/tags/LinkinPark" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>LinkinPark</span></a> <a href="https://flipboard.social/tags/Crawling" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Crawling</span></a><br><a href="https://daletra.com.br/linkin-park/letra/crawling.html" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">daletra.com.br/linkin-park/let</span><span class="invisible">ra/crawling.html</span></a></p>
Max Resing<p>It looks like LLM-producing companies that are massively <a href="https://infosec.exchange/tags/crawling" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>crawling</span></a> the <a href="https://infosec.exchange/tags/web" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>web</span></a> require the owners of a website to take action to opt out. Albeit I am not intrinsically against <a href="https://infosec.exchange/tags/generativeai" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>generativeai</span></a> and the acquisition of <a href="https://infosec.exchange/tags/opendata" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>opendata</span></a>, reading about hundreds of dollars of rising <a href="https://infosec.exchange/tags/cloud" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>cloud</span></a> costs for hobby projects is quite concerning. How is it accepted that hypergiants skyrocket the costs of tightly budgeted projects through massive spikes in egress traffic and increased processing requirements? Projects that run on a shoestring budget and are operated by volunteers who dedicated hundreds of hours without any reward other than believing in their mission?</p><p>I am mostly concerned about the default of opting out. Are the owners of those projects required to take action? Seriously? As an <a href="https://infosec.exchange/tags/operator" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>operator</span></a>, it would be my responsibility to methodically work myself through the crawling documentation of the hundreds of <a href="https://infosec.exchange/tags/LLM" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>LLM</span></a> <a href="https://infosec.exchange/tags/web" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>web</span></a> <a href="https://infosec.exchange/tags/crawlers" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>crawlers</span></a>? I am the one responsible for configuring a unique crawling specification in my robots.txt because hypergiants make it immanently hard to have generic <a href="https://infosec.exchange/tags/opt" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>opt</span></a>-out configurations that tackle LLM projects specifically?</p><p>I reject to accept that this is our new norm. A norm in which hypergiants are not only methodically exploiting the work of thousands of individuals for their own benefit and without returning a penny. But also a norm, in which the resource owner is required to prevent these crawlers from skyrocketing one's own operational costs?</p><p>We require a new <a href="https://infosec.exchange/tags/opt" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>opt</span></a>-in. Often, public and open projects are keen to share their data. They just don't like the idea of carrying the unpredictable, multitudinous financial burden of sharing the data without notice from said crawlers. Even <a href="https://infosec.exchange/tags/CommonCrawl" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>CommonCrawl</span></a> has safe-fail mechanisms to reduce the burden on website owners. Why are LLM crawlers above the guidelines of good <a href="https://infosec.exchange/tags/Internet" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Internet</span></a> citizenship?</p><p>To counter the most common argument already: Yes, you can deny-by-default in your robots.txt, but that excludes any non-mainstream browser, too.</p><p>Some concerning <a href="https://infosec.exchange/tags/news" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>news</span></a> articles on the topic:</p><ul><li><a href="https://archive.is/nQ6Gk" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="">archive.is/nQ6Gk</span><span class="invisible"></span></a></li><li><a href="https://archive.is/CRwVs" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="">archive.is/CRwVs</span><span class="invisible"></span></a></li></ul><p><a href="https://infosec.exchange/tags/webcrawling" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>webcrawling</span></a> <a href="https://infosec.exchange/tags/crawler" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>crawler</span></a> <a href="https://infosec.exchange/tags/web" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>web</span></a> <a href="https://infosec.exchange/tags/opensource" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>opensource</span></a></p>
Kevin Karhan :verified:<p><span class="h-card" translate="no"><a href="https://mastodon.online/@zdl" class="u-url mention" rel="nofollow noopener" target="_blank">@<span>zdl</span></a></span> <span class="h-card" translate="no"><a href="https://hachyderm.io/@evacide" class="u-url mention" rel="nofollow noopener" target="_blank">@<span>evacide</span></a></span> that any the fact that <span class="h-card" translate="no"><a href="https://mastodon.world/@signalapp" class="u-url mention" rel="nofollow noopener" target="_blank">@<span>signalapp</span></a></span> is incorportated in the <a href="https://infosec.space/tags/USA" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>USA</span></a>, making them susceptible to <a href="https://infosec.space/tags/GDPR" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>GDPR</span></a> &amp; <a href="https://infosec.space/tags/BDSG" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>BDSG</span></a>-incompatible <a href="https://infosec.space/tags/cyberfacist" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>cyberfacist</span></a> bs like <a href="https://infosec.space/tags/CloudAct" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>CloudAct</span></a>.</p><ul><li>If <a href="https://infosec.space/tags/Signal" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Signal</span></a> cared, they'd completely <a href="https://infosec.space/tags/OpenSource" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>OpenSource</span></a> <a href="https://infosec.space/tags/backend" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>backend</span></a> and <a href="https://infosec.space/tags/frontend" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>frontend</span></a> as well as <a href="https://infosec.space/tags/decentralize" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>decentralize</span></a> and refuse to collect any <a href="https://infosec.space/tags/PII" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>PII</span></a> (like <a href="https://infosec.space/tags/PhoneNumers" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>PhoneNumers</span></a>) <em>at all</em>!</li></ul><p>Remember: <a href="https://infosec.space/tags/KYC" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>KYC</span></a> <em>IS</em> THE ILLICIT ACTIVITY when it comes to <a href="https://infosec.space/tags/Communication" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Communication</span></a>!</p><ul><li>To me Signal has a stench like <a href="https://infosec.space/tags/CryptoAG" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>CryptoAG</span></a> (aka. <a href="https://infosec.space/tags/MINERVA" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>MINERVA</span></a> / <a href="https://infosec.space/tags/RUBIKON" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>RUBIKON</span></a>), <a href="https://infosec.space/tags/EncroChat" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>EncroChat</span></a> and espechally <a href="https://infosec.space/tags/AN%C3%98M" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>ANØM</span></a> (aka. <a href="https://infosec.space/tags/OperationIronside" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>OperationIronside</span></a> / <a href="https://infosec.space/tags/OperationTr%C3%B8janShield" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>OperationTrøjanShield</span></a>)...</li></ul><p>Compare that to <span class="h-card" translate="no"><a href="https://monocles.social/@monocles" class="u-url mention" rel="nofollow noopener" target="_blank">@<span>monocles</span></a></span> / <a href="https://infosec.space/tags/monoclesChat" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>monoclesChat</span></a> which don't demand any PII or KYC and allow people to pay for their services with <a href="https://infosec.space/tags/Monero" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Monero</span></a> and <a href="https://infosec.space/tags/CashByMail" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>CashByMail</span></a> besides <a href="https://infosec.space/tags/SEPA" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>SEPA</span></a> <a href="https://infosec.space/tags/WireTransfer" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>WireTransfer</span></a>, <a href="https://infosec.space/tags/Stripe" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Stripe</span></a> &amp; <a href="https://infosec.space/tags/PayPal" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>PayPal</span></a> whilst supporting both decentralization (<a href="https://infosec.space/tags/XMPP" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>XMPP</span></a> is not a <a href="https://infosec.space/tags/SingleVendor" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>SingleVendor</span></a> / <a href="https://infosec.space/tags/SingleProvider" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>SingleProvider</span></a> solution!), implementing real <a href="https://infosec.space/tags/SelfCustody" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>SelfCustody</span></a> (<a href="https://infosec.space/tags/OMEMO" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>OMEMO</span></a>, <a href="https://infosec.space/tags/OTR" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>OTR</span></a> &amp; <a href="https://infosec.space/tags/PGP" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>PGP</span></a> is supported out of the box) for all the keys, and proper <a href="https://infosec.space/tags/Anonymitiy" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Anonymitiy</span></a> (using <span class="h-card" translate="no"><a href="https://mastodon.social/@torproject" class="u-url mention" rel="nofollow noopener" target="_blank">@<span>torproject</span></a></span> / <a href="https://infosec.space/tags/Tor" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Tor</span></a> &amp; <span class="h-card" translate="no"><a href="https://social.librem.one/@guardianproject" class="u-url mention" rel="nofollow noopener" target="_blank">@<span>guardianproject</span></a></span> <a href="https://infosec.space/tags/Orbot" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Orbot</span></a> for <a href="https://infosec.space/tags/privacy" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>privacy</span></a>), so in case they ever get a <em>duely sumitted warrant</em> by a court they'd have to comply with, they'll most likely have no data whatsoever on clients that could allow identification.</p><ul><li>And that <em>is</em> a good thing, because whilst <em>very unlikely</em>, one cannot exclude the non-zero chance of i.e. <a href="https://infosec.space/tags/MLAT" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>MLAT</span></a>|s being filed with knowingly false information by 3rd countries.</li></ul><p>Also having no PII is a matter of reducing <a href="https://infosec.space/tags/liability" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>liability</span></a> in the sense of <a href="https://infosec.space/tags/DataProtection" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>DataProtection</span></a>: All data requested and by <a href="https://infosec.space/tags/monocles" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>monocles</span></a> is the bare minimum mandated for <a href="https://infosec.space/tags/accounting" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>accounting</span></a> (i.e. only linking a payment like a <a href="https://infosec.space/tags/TxID" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>TxID</span></a> / Transaction-ID to an account and then adding up validity/activation period).</p><ul><li>And since running a <a href="https://infosec.space/tags/Service" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Service</span></a> <em>costs money</em>, the low <a href="https://infosec.space/tags/subscription" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>subscription</span></a> to their <a href="https://infosec.space/tags/Services" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Services</span></a> makes them independent from <a href="https://infosec.space/tags/ads" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>ads</span></a>, <a href="https://infosec.space/tags/crawling" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>crawling</span></a> / <a href="https://infosec.space/tags/espionage" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>espionage</span></a> against <a href="https://infosec.space/tags/customers" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>customers</span></a> and depending on <a href="https://infosec.space/tags/grants" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>grants</span></a> and <a href="https://infosec.space/tags/donations" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>donations</span></a> to keep the lights on, making it a <a href="https://infosec.space/tags/sustainable" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>sustainable</span></a> <a href="https://infosec.space/tags/business" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>business</span></a>...</li></ul>
Starbeamrainbowlabs<p>Implementation progress: got the crawler written &amp; integrated with the indexes + thumbnailer (+`.gitignore` support) but not tested. The vector search persistence system is truly shocking..... I think I need to back it in an `njodb` instance.</p><p>Next up:</p><p>- testing the crawler<br>- integrating with the rest of the app<br>- HTTP API<br>- starting working on the web interface</p><p>(see also &lt;<a href="https://www.npmjs.com/package/njodb" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://www.</span><span class="">npmjs.com/package/njodb</span><span class="invisible"></span></a>&gt;)</p><p><a href="https://fediscience.org/tags/OpenSource" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>OpenSource</span></a> <a href="https://fediscience.org/tags/Eventually" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Eventually</span></a> <a href="https://fediscience.org/tags/OnceIveFinishedAMinimumWorkingVersion" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>OnceIveFinishedAMinimumWorkingVersion</span></a> <a href="https://fediscience.org/tags/Javascript" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Javascript</span></a> <a href="https://fediscience.org/tags/Crawling" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Crawling</span></a> <a href="https://fediscience.org/tags/Programming" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Programming</span></a> <a href="https://fediscience.org/tags/Chatter" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Chatter</span></a> <a href="https://fediscience.org/tags/ProgressUpdate" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>ProgressUpdate</span></a> <a href="https://fediscience.org/tags/microblog" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>microblog</span></a> <a href="https://fediscience.org/tags/IsThereATagForPhotoManagement" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>IsThereATagForPhotoManagement</span></a> <a href="https://fediscience.org/tags/PhotoManagement" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>PhotoManagement</span></a> <a href="https://fediscience.org/tags/WillTagWithPhotographyOnceTheresSomethingToSee" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>WillTagWithPhotographyOnceTheresSomethingToSee</span></a></p>
DaLetra Français<p>Voir les paroles de la chanson “Crawling” de Linkin Park<br><a href="https://flipboard.social/tags/LinkinPark" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>LinkinPark</span></a> <a href="https://flipboard.social/tags/Crawling" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Crawling</span></a><br><a href="https://daletra.art/linkin-park/paroles/crawling.html" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">daletra.art/linkin-park/parole</span><span class="invisible">s/crawling.html</span></a></p>
8 Ball System<p>Hey little <a href="https://8ballsystem.gay/tags/tip" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>tip</span></a> for you all, if you want to <a href="https://8ballsystem.gay/tags/block" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>block</span></a> <a href="https://8ballsystem.gay/tags/openAI" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>openAI</span></a> from <a href="https://8ballsystem.gay/tags/crawling" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>crawling</span></a> your <a href="https://8ballsystem.gay/tags/server" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>server</span></a> or <a href="https://8ballsystem.gay/tags/website" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>website</span></a>, they public a list of IP addresses that they use for crawling.</p><p><a href="https://platform.openai.com/docs/bots" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="">platform.openai.com/docs/bots</span><span class="invisible"></span></a></p><p>I plan on deploying firewall blocks on all these IP addresses later!</p><p>If you know how <a href="https://8ballsystem.gay/tags/meta" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>meta</span></a> and <a href="https://8ballsystem.gay/tags/google" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>google</span></a>'s IP's they use for crawling please drop them here!! </p><p>Keep yourself and your stuff safe!</p>
DaLetra Français<p>Découvrez les paroles de la chanson “Crawling” de Linkin Park<br><a href="https://flipboard.social/tags/LinkinPark" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>LinkinPark</span></a> <a href="https://flipboard.social/tags/Crawling" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Crawling</span></a><br><a href="https://daletra.art/linkin-park/paroles/crawling.html" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">daletra.art/linkin-park/parole</span><span class="invisible">s/crawling.html</span></a></p>
DaLetra English<p>Lyrics for the song “Crawling” by Linkin Park<br><a href="https://flipboard.social/tags/LinkinPark" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>LinkinPark</span></a> <a href="https://flipboard.social/tags/Crawling" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Crawling</span></a><br><a href="https://daletra.com/linkin-park/lyrics/crawling.html" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">daletra.com/linkin-park/lyrics</span><span class="invisible">/crawling.html</span></a></p>
panigrc<p><span class="h-card" translate="no"><a href="https://phpc.social/@otsch" class="u-url mention" rel="nofollow noopener" target="_blank">@<span>otsch</span></a></span> Thanks for your contribution! 💙 </p><p>The most difficult part about crawling is <a href="https://mastodon.social/tags/cloudflare" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>cloudflare</span></a> 😡 <br>the only way I found to bypass it, but it doesn't work 100% of the time is <a href="https://mastodon.social/tags/flaresolverr" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>flaresolverr</span></a> <a href="https://github.com/FlareSolverr/FlareSolverr" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">github.com/FlareSolverr/FlareS</span><span class="invisible">olverr</span></a></p><p>Since you started supporting structured data, here is an idea for you. Start supporting <a href="https://mastodon.social/tags/microformats" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>microformats</span></a> <a href="https://microformats.org/" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="">microformats.org/</span><span class="invisible"></span></a></p><p><a href="https://mastodon.social/tags/php" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>php</span></a> <a href="https://mastodon.social/tags/webscraping" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>webscraping</span></a> <a href="https://mastodon.social/tags/crawling" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>crawling</span></a></p>
DaLetra Français<p>Découvrez les paroles de la chanson “Crawling” de Linkin Park<br><a href="https://flipboard.social/tags/LinkinPark" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>LinkinPark</span></a> <a href="https://flipboard.social/tags/Crawling" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Crawling</span></a><br><a href="https://daletra.art/linkin-park/paroles/crawling.html" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">daletra.art/linkin-park/parole</span><span class="invisible">s/crawling.html</span></a></p>
DaLetra English<p>Check out the lyrics for the song “Crawling” by Linkin Park<br><a href="https://flipboard.social/tags/LinkinPark" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>LinkinPark</span></a> <a href="https://flipboard.social/tags/Crawling" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Crawling</span></a><br><a href="https://daletra.com/linkin-park/lyrics/crawling.html" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">daletra.com/linkin-park/lyrics</span><span class="invisible">/crawling.html</span></a></p>
Erik Jonker<p>Interesting fact with regard to Meta External Agent (An AI crawler), as of August 2024, only about 2% of popular websites were blocking the Meta External Agent, compared to 25% blocking OpenAI's GPTBot. Personally i think trying to block AI webcrawlers is useless. If a human is allowed to read a website why shouldn't we let a machine do this. There are many legitimate uses for machines reading websites, not only AI.<br><a href="https://fortune.com/2024/08/20/meta-external-agent-new-web-crawler-bot-scrape-data-train-ai-models-llama/" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">fortune.com/2024/08/20/meta-ex</span><span class="invisible">ternal-agent-new-web-crawler-bot-scrape-data-train-ai-models-llama/</span></a><br><a href="https://mastodon.social/tags/metaexternalagent" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>metaexternalagent</span></a> <a href="https://mastodon.social/tags/meta" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>meta</span></a> <a href="https://mastodon.social/tags/crawling" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>crawling</span></a> <a href="https://mastodon.social/tags/AI" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>AI</span></a></p>
Josef 'Jeff' Sipek<p>Holy crap! <a href="https://mastodon.radio/tags/Huawei" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Huawei</span></a> is *very* aggressively <a href="https://mastodon.radio/tags/crawling" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>crawling</span></a> my <a href="https://mastodon.radio/tags/webserver" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>webserver</span></a>. As in, 6+ requests/sec for many hours coming from quite a few IPs. Here are the subnets I blocked which kills most of the traffic so far:</p><p>49.0.200.0/21<br>94.74.80.0/20<br>101.44.160.0/20<br>111.119.192.0/20<br>114.119.172.0/22<br>114.119.176.0/20<br>119.8.160.0/19<br>119.13.96.0/20<br>124.243.128.0/18<br>159.138.96.0/20<br>166.108.192.0/20<br>166.108.224.0/20<br>190.92.192.0/19</p><p>The user agent string is the typical "every browser in existence". <a href="https://mastodon.radio/tags/webcrawler" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>webcrawler</span></a></p>
Lyrical Garfield 🎶 :garfield:<p><a href="https://masto.ai/tags/Garfield" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Garfield</span></a> <a href="https://masto.ai/tags/music" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>music</span></a> <a href="https://masto.ai/tags/Crawling" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Crawling</span></a><br> <a href="https://masto.ai/tags/bot" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>bot</span></a> <a href="https://masto.ai/tags/lyrics" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>lyrics</span></a></p>