AI's Actual Impact
Two years after OpenAI launched ChatGPT 3.5, humanity is not on the cusp of extinction and Elon Musk seems more responsible for job loss than any AI agent.
However, ask any web administrator and you will learn that large language models are having a significant impact on the world wide web (or, for a less technical account, see Forbes articles on bots). At May First, a membership organization that has been supporting thousands of web site for over 20 years, we have never seen anything like this before.
Wow, my site is getting really popular
It started in 2023. Web sites that performed quite well with a steady viewership started having traffic spikes. These were relatively easy to diagnose, since most of the spikes came from visitors that properly identified themselves as bots, allowing us to see that the big players - OpenAI, Bing, Google, Facebook - were increasing their efforts to scrape as much content from web sites as possible.
Small brochure sites were mostly unaffected because they could be scraped in a matter of minutes. But large sites with an archive of high quality human written content were getting hammered. Any web site with a search feature or a calendar or any interface that generated exponential hits that could be followed were particularly vulnerable.
But hey, that’s what robots.txt is for, right? To tell robots to back off if you don’t want them scraping your site?
Eventually, the cracks began to show. Bots were ignoring robots.txt (did they ever pay that much attention to it in the first place?). Furthermore, rate limiting requests by user agent also began to fail. When you post a link on Facebook, a bot identifying itself as “facebooketernalhit” is invoked to preview the page so it can show a picture and other meta data. We don’t want to rate limit that bot, right? Except, Facebook is also using this bot to scrape your site, often bringing your site to its knees. And don’t get me started on TwitterBot.
Eventually, it became clear that the majority of the armies of bots scraping our sites have completely given up on identifying themselves as bots and are instead using user agents indistinguishable from regular browsers. By using thousands of different IP addresses, it has become really hard to separate the real humans from the bots.
Now what?
So, no, unfortunately, your web site is not suddenly getting really popular. And, you are blessed with a whole new set of strategic decisions.
Fortunately, May First has undergone a major infrastructure transition, resulting in centralized logging of all web sites and a fleet of web proxy servers that intercept all web traffic. Centralized logging means we can analyze traffic and identify bots more easily, and a web proxy fleet allows us to more easily implement rules across all web sites.
However, even with all of our latest changes and hours upon hours of work to keep out the bots, our members are facing some hard decisions about maintaining an open web.
One member of May First provides Google translations of their web site to every language available. But wow, that is now a disaster because instead of having every bot under the sun scrapping all 843 (a made up number) pieces of unique content on their site, the same bots are scraping 843 * (number of available languages) pieces of content on their site. Should they stop providing this translation service in order to ensure people can access their site in the site’s primary language?
Should web sites turn off their search features that include drop down options of categories to prevent bots from systematically refreshing the search page with every possible combination of search terms?
Do we need to alter our calendar software to avoid providing endless links into the future (ok, that is an easy one)?
What’s next?
Something has to change.
-
Lock down web 2.0. Web 2.0 brought us wonderful dynamic web sites, which Drupal and WordPress and many other pieces of amazing software have supported for over a decade. This is the software that is getting bogged down by bots. Maybe we need to figure out a way to lock down the dynamic aspects of this software to logged in users and provide static content for everyone else?
-
Paywalls and accounts everywhere. There’s always been an amazing non-financial reward to providing a web site with high quality movement oriented content for free. It populates the search engines, provides links to inspiring and useful content in moments of crises, and can galvanize movements. But these moments of triumph happen between long periods of hard labor that now seems to mostly feed capitalist AI scumbags. If we add a new set of expenses and labor to keep the sites running for this purpose, how sustainable is that? Will our treasure of free movement content have to move behind paywalls or logins? If we provide logins, will that keep the bots out or just create a small hurdle for them to automate the account creation process? What happens when we can’t search for this kind of content via search engines?
-
Cutting deals. What if our movement content providers are forced to cut deals with the AI entrepreneurs to allow the paying scumbags to fund the content creation. Eww. Enough said.
-
Bot detection. Maybe we just need to get better at bot detection? This will surely be an arms race, but would have some good benefits. Bots have also been filling out our forms and populating our databases with spam, testing credit cards against our donation pages, conducting denial of service attacks and all kinds of other irritating acts of vandalism. If we were better at stopping bots automatically it would have a lot of benefits. But what impact would it have on our web sites and the experience of using them? What about “good” bots (RSS feed readers, payment processors, web hooks, uptime detectors)? Will we cut the legs off any developer trying to automate something?
I’m not really sure where this is going, but it seems that the world wide web is about to head in a new direction.