How I broke a client's site without getting fired

It had to happen eventually, I suppose, but I broke a client site last week. It’s private forum that uses WordPress as an SSO[1] server for Discourse. It also uses MemberPress to manage paid subscriptions. MemberPress, in turn, relies on Stripe to collect subscriptions fees. To explain how I broke the site and how I was able to recover, I’ll need to go into the details of the problem. Feel free to skip the next section if that doesn’t sound interesting.

My “solution”

When someone signs up for a subscription, they log into my client’s WordPress site which uses MemberPress to communicate with Stripe. Stripe collects the subscription fee and assigns the user to a special group for subscribers on WordPress. When the user logs into Discourse, the WP Discourse plugin updates the user’s Discourse groups using code adapted from this answer. Critically the user is added to a private group on Discourse that’s only available to suscribers.

So the system looks something like this:

flowchart LR
   Discourse <-->|WP Discourse|WordPress 
   WordPress <-->|MemberPress|Stripe

It works well enough by default. Whenever a user logs into Discourse, their groups are updated by WP Discourse. If their subscription has expired, they are removed from the subscriber’s-only group. So just before someone goes to look at the paywalled content, the system verifies they have paid up for it.

Recently, however, we added an option[2] for users to get an email whenever a subscriber-only newsletter comes out. Discourse doesn’t require people to log in before getting these emails[3] so it doesn’t check with WordPress to determine if the user has an active subscription. So users will get an email after their subscription has lapsed if they don’t log into Discourse.

Obviously that’s far from ideal. The good news is that there’s a way for WP DIscourse to manage groups when subscriptions stop. All you need is a little bit of PHP code:

use WPDiscourse\Utilities\Utilities as DiscourseUtilities;
function mepr_capture_stopped_sub($event) {
    $subscription = $event->get_data();
    $user = $subscription->user();
    
    if ( ! $user->has_cap( 'mepr-active','memberships:[comma-separated group IDs]' )  ) {
        $result = DiscourseUtilities::remove_user_from_discourse_group( $user.id, 'Discourse_group_name' );
	}

}
add_action('mepr-event-subscription-stopped', 'mepr_capture_stopped_sub');

But where does that code go? It could be a WordPress plugin or functions.php. My client uses the My Custom Functions plugin, however, so that’s where I decided to put it.

How I screwed up

If you didn’t guess from the previous heading’s “scare quotes”, I didn’t actually solve the problem. I assumed that I could make a changes to custom functions and get an error message if something went wrong. Not quite:

This is, believe it or not, more informative than older versions of WordPress that was called the White Screen of Death[4] which was literally a blank, white screen. Fortunately getting a working site is as simple as disabling My Custom Functions

  1. Access your server via FTP or SFTP.
    If you aren’t sure how usually your web hosting provider will have instructions somewhere on their website.

  2. Browse to the directory wp-content/plugins/my-custom-functions-pro/.
    The location of the folder wp-content depends on your host’s setup. Typically, the folder public_html contains all the files of the website, among which you will find this folder. Please contact your web hosting company to get help if you can’t find this folder.

  3. Rename the file START to STOP. This will stop the execution of your custom code. Now your website should be returned to life and the WordPress Admin Area should be accessible.

I, uh, didn’t even know what web host my client was using. So I manage to break my client’s site with no way to fix it. In fairness to me, WordPress is unusually unforgiving. Still, I should have made sure I understood what I was doing before I started doing it.

There was a temptation to pretend I didn’t know what had happened. My story could have been, “I was poking around the site and suddenly it stopped working.” That’d be true, as far as it goes. What I said instead was:

I’m getting a error trying to log into https://www.example.com/wp-admin/index.php I was working on a hook to remove subscriptions, so it might be my fault. We might need to use FTP to revert my change.

That’s still not 100% honest. I really did know it was my fault. I’d inserted code that somehow caused the site to stop working and to imply it might be anything else was a face-saving lie. An hour and a half later, my client saw the message and restored from backup. I can’t imagine he was happy with me, but we got into a problem-solving mode and got it fixed.

Then we worked out a process for me to test code on our staging system and validate it with another developer before putting it in production. We agreed to make changes to plan for me to make production changes in off-hours. I also got SFTP access so that I can quickly fix the problem myself. All of these are things I should have done in the first place, but better late then never.

It reminded me of “The Time I Stole $10,000 from Bell Labs” by Thomas A. Limoncelli:

Learning from incidents does not magically happen. The desire may exist, but more is required. The shift from blame to learning requires a commitment from executives, management, and non-management alike. Executives must model blameless behavior and encourage learning. Management must create processes that enable learning. Project managers need to allocate space and time for these processes to happen. Everyone must learn to be more open and humble.

Postscript

A few days later I got a message from the client that the site was broken again. I hadn’t been doing anything at the time (I got the message just after I woke up in the morning) so I had no reason to suspect it was my fault. Thankfully it was a Discourse problem caused by an outdated Theme Component. I was able to get the site functioning correctly within a few minutes. Then I spent a few more minutes diagnosing the problem. By the end of the day I’d worked with the author of the broken component to get a fix to my client returning the site to the desired state.

Presumably I won’t usually get an immediate redemption when I screw up in the future. Itr is nice to know, however, that my expertise provides real value to customers.


  1. Single Sign-On ↩︎

  2. Technically I documented an existing option in Discourse so that users can find it. :wink: ↩︎

  3. It’s kinda the whole point of sending an email. ↩︎

  4. Joining the XBox Red Ring of Death and Windows’ Blue Screen of Death. It’s not a great sign when your error message gets a nickname. ↩︎