Understanding spam with PostHog and Discourse

jericson · September 13, 2024, 5:59am

The day after posting about spam on this site, I got a flood (relative to a site this size) of spam. It’s exciting to get more data and I now feel comfortable creating spam:

Create: 'spam'

Or, to be more precise, the spam tag. Creating actual spam is harder because Discourse does some things to prevent it. For instance, min_first_post_typing_time:

Minimum amount of time in milliseconds a user must type during first post, if threshold is not met post will automatically enter the needs approval queue. Set to 0 to disable (not recommended)

Using PostHog, I was able to watch this happen in a session recording and here’s what one of my spammers saw:

Post Needs Approval We've received your new post but it needs to be approved by a moderator before it will appear. Please be patient. You have 1 post pending

That turned out to be enough to deter this particular spammer. It also explains why successful spammers don’t paste in their payload immediately. They need to take a bit of time typing something in order to avoid being blocked. It seems some know this trick, but others don’t. Increasing this value can dramatically reduce spam without significantly impacting new contributors. People who actually want to contribute won’t be deterred by that message if their post is approved fairly quickly. That’s if they run into the roadblock at all. Contributors generally take their time when making their entrance to a community.

So how did I find these session recordings in PostHog? I followed the IPs. When an admin deletes a user in Discourse, the logs (/admin/logs/staff_action_logs) show the IP the user used.^[1] I’ve edited the IP to ***.***.***.*** to protect the guilty.

Over in PostHog, I create a new filter with that IP.

Then I can add all the IPs I collected to look for patterns. What URLs did they visit? What type of device, browser and OS did they use? Did they come direct or via a link? And, of course, there’s a list of session recordings if you have that feature turned on.

So what did I learn? Well it turns out all the spammer after the initial one visited my site directly by entering the URL from one of the spam topics I deleted. The only way that could happen is if it was the same person or if the initial spammer passed the link to his colleagues. My guess is this is different people because some knew to take some time to type something first and others hit the block by pasting in the payload. Also the IPs were from different cities in the same country.

They all used an Android phone to log in. Some created their account with Google credentials and others used email and password. Successful spammers did 3 topics with one reply. Each spammer had a different payload and none seemed at all interesting. Each took longer making spam than I took deleting it, but thanks for timezones, there was spam on my site for ~4 hours before I noticed.

I’m not sure without looking it up if this is the IP used at registration or the last IP. For these spam incidents, it doesn’t matter one way or the other. ↩︎