2021-03-24 19:31:38

In light of concerns regarding the forum's pending doom, I've decided to release the script I use to archive the game database. You can find it on my Github.

The last time I tried it took somewhere short of an hour and produced a 3MB dump. This could be a hell of a lot quicker, but we're fetching many small pages and trying to be kind on the server. Unlike my actual forum backup, this is small enough to be comfortably manageable with JSON.

In conjunction with the one Chris released which operates on the forum, I'm hoping
we can satisfy data hoarders and, more importantly, ensure that all the hard work on the part of our amazing editors is preserved for years to come.

I also figured this wouldn't be a bad time for a little update. We are aware of the errors some of you have been getting when trying to visit and interact with the forum, and aside from some personal speculation are just as clueless. I'm doing what I can with what I have to chase it down.

In the meantime, rest assured that I've been running a pretty scalable off-sight backup solution, automatically grabbing everything for a couple months now. As such, this isn't a cry for help. The script is more for fun than necessity.

Oh how I wish I could just copy the database, sit back and have a beer. But alas, no server access means the actual process has involved more code than I like to think about toward a solution that will never end up quite as thorough as it could be. If the site were ever to go down, I'd need about a full day to develop a minimalistic browser and bring it online, at which point I'm reasonably sure the hosts will have said hello. In short, we shouldn't ever be losing more than a month or so of content, which would totally suck but have nothing on our sixteen years and counting.

2021-03-24 19:44:00

@cartertemm
Excellent work.  You did what you could with what you had, and it's much appreciated.

2021-03-24 20:03:04

Thanks for your work. You are absolutely right, even if we lose a month or 2 it's not a big deal, it has already happened once in the past. Important that the majority of the content remains.

2021-03-25 15:56:08

Imho such off-site backups should be regularly scheduled so as not to lose much if/when the site goes boom. Is there any way to bypass Cloudflare in order to get direct access to the server hosting the forum so as not to annoy Cloudflare with excessive scraping? Also, do these scripts at least attempt to grab only what is new or changed since the last scrape?

2021-03-25 16:47:10

thankfully the only part of the site sitting behind CloudFlare right now appears to be the introductions room, so I'm willing to assume that a couple topics are OK.

There's caching here, but it's limited in the sense that we still have to fetch every game on every run. I do it this way because I expect existing entries to change, especially given the recent crackdown on unauthorized assets.

As for forum downloaders, mine and I'm pretty sure Chris's don't even touch the topic if the post count is the same. Edits could happen I suppose, but I'm not willing to trade four hours and sixteen years for a couple spelling errors or clarification. You indirectly raise an interesting point though, I probably need to be treating sticky posts differently. Appreciate the food for thought.

2021-03-25 17:12:54

@5, the entire site is behind CF now. If you like I can (try) to get you the real IP addresses.

"On two occasions I have been asked [by members of Parliament!]: 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out ?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question."    — Charles Babbage.
My Github

2021-03-25 17:50:05

You're right, good catch. I see the CF headers now and will work on updating my scripts.

Unfortunately, the techniques I ordinarily use to get CloudFlare powered IPs aren't proving very effective right now, but I'll continue chipping at it later today. Kind of ironic that a site admin has to resort to pentesting to get the address of the box, but hey, nobody asked me.

2021-03-25 19:03:40

@cartertemm, I've sent you a PM possibly containing useful information.

"On two occasions I have been asked [by members of Parliament!]: 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out ?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question."    — Charles Babbage.
My Github

2021-03-25 19:08:30

And I just sent you another containing an update.

"On two occasions I have been asked [by members of Parliament!]: 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out ?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question."    — Charles Babbage.
My Github