Thoughts On The SQLSaturday Upgrade & Security Issue

Last week PASS deployed an update to the SQLSaturday web site and then had to take the site down for almost four days to fix security issues. You can read about the features and the issues in these posts from PASS:

I love – truly – that PASS has already posted about what happened. That’s an important step for the org, and one that makes me reluctant to post anything at all about the incident because I don’t want to discourage transparency. I hope the trend of that kind of communication continues.

As far as the upgrade, it seems to go further than just a new coat of paint while still being an evolutionary change and I consider that good – better to do incremental change where possible because doing otherwise can really frustrate users who are are used to things being in a certain place. I imagine some of us will value some changes more than others, but I didn’t see anything that felt like it was  a step backwards. Many seemed to address common problems from speakers and event organizers – good! I wish there was more data made public via API and/or the guidebook xml file. I’ve heard that with the new site scraping is harder than before, and really, should we make people scrape? That would be an area where some minor add on work could yield some nice benefits.

On the private administrative site used by organizers there was work done also, though I’d like to see more. Some of the reports that I wrote back in 2009 are still there and need review/sprucing up, and the dashboard is still not as functional as I’d hoped- for example it doesn’t have the one marketing chart I value the most, the one that shows cumulative registrations by week (the chart is there, but it’s the number that registered each week, not the cumulative). That’s a minor quibble, but one I hope will get addressed in the next update.

There was a decent amount of messaging about the planned outage for the upgrade. I’d have liked to have seen more messaging either before or after about the details. I think details are both reassuring and interesting. Why and how are perhaps more interesting to us than it would be to a homeowners association. It would also help us assess the duration/investment to the return on that investment.  The PASS IT team did a lot of work, it’s always good to tell the story and celebrate the work.

On to the rollout.

I was surprised about taking an entire weekend for the outage, but took it to be a hyper conservative approach which I couldn’t fault. I had just started looking at the new site when the security issue was mentioned on Twitter and then the site was offline.

It’s never good to have a security issue, but it was handled as well as I’ve seen PASS handle any issue – just about an immediate response and the site was updated with a brief message about the outage. Kudos on that.

The messaging after that was ok, but felt minimalist. I’ve been through these, probably you have too, you’re trying really, really hard to fix whatever is broken and get things going, so messaging gets pushed off, delayed, whatever. It’s important to have someone (that is, one person) public facing that can answer questions, even if the answer is “don’t know”.  I’d like to have seen the SQLSat web site updated with more news, or links to blog/twitter posts so that attendees or potential attendees had  more info to work with – they aren’t all on Twitter.

It seemed to end up being an all or nothing deployment, not my favorite kind. Ideally (and I don’t know if this was possible) the sites would run in parallel for some period of time, the “old” site as primary and anyone who wanted could try the new site.

On the security flaws, stuff happens. I’d be curious to know if PASS is using vulnerability scans and code analysis to supplement testers, but even then, it can be hard to catch issues. It was found, action taken, that’s the good part.

The bad part is obvious, a four day outage on an important site. I’ll argue that to have any site down for four days is just not acceptable. To be clear I’m not suggesting that leaving it up with the security flaw was acceptable either, but that could have been addressed in other ways (killing the page for example). My point is not to second guess decisions, it’s to look at the results.  Going into this no one would have said that four days was acceptable and it’s still not.

The site is up and I’m sure post-deployment tasks are being done and lessons learned compiled and internalized. Now I’d like to see PASS go forth and survey users, get good feedback, and publish a backlog and list of ideas for discussion for the next version.