Just as an FYI, I have a phpBB scraper ready to go, and have tested it with
cassieiswatching.myfreeforum.org, which I now have an XML dump of, should something ever happen to it ;) I'll work on making that XML dump usable (and searchable) in an effort to gather as much CiW-related information in one place. Once usable, I'll make a download available as well. If someone wants the raw files in the meantime, just ask :)
Hey!
ReplyDeleteAny chance you'd share it with me?
Thanks
The scraper, or the raw files? I may have replaced both, so let me know, and I'll see what I can locate :)
ReplyDeleteHey! The scraper if possible :)
ReplyDeleteThanks!
Sure... I'll see if I can locate it. It's someone else's code, and I don't even remember if I had tweaked it at all.
ReplyDeleteI found the code :) Rather than paste it in here, the original URL that I found it at was https://github.com/indigolemon/phpBB-Scraper. I didn't modify it, other than set the username, password, and URL to the site I was scraping.
ReplyDeleteGreat thanks!
ReplyDeleteI did find it yesterday and used it but after around 4.5 hours it was still running (and whilst the destination folder was rightly created on my machine it was empty). I guess it's taking time to scrape a whole forum!
Have a nice day
I would maybe suggest setting the value of $start_topic and $end_topic maybe 10 or so apart (and make sure at least one valid topic # falls between the two values) and do a test-run, just to make sure it completes without error. Then crank up the value of $end_topic to a sufficiently high value to capture everything.
ReplyDeleteThat's a good idea, thanks. I've launched the script on my server (which is in a datacenter so it's running 24/7 anyway) and still no luck after ~15 hours. The phpbb forum has more than 8000 topics so I'm not sure if it's normal or not.
ReplyDeleteIt's not very urgent so I can wait...
I don't really know how long it would take, as I've never benchmarked it. I know I've run it against a small forum, on a fairly fast server, but I honestly couldn't tell you how long it took. I do seem to recall that it worked, though, as I have a capture of that forum around somewhere.
ReplyDelete