Wednesday, March 5, 2014

phpBB Scraper

Just as an FYI, I have a phpBB scraper ready to go, and have tested it with cassieiswatching.myfreeforum.org, which I now have an XML dump of, should something ever happen to it ;) I'll work on making that XML dump usable (and searchable) in an effort to gather as much CiW-related information in one place. Once usable, I'll make a download available as well. If someone wants the raw files in the meantime, just ask :)

9 comments:

  1. Hey!

    Any chance you'd share it with me?

    Thanks

    ReplyDelete
  2. The scraper, or the raw files? I may have replaced both, so let me know, and I'll see what I can locate :)

    ReplyDelete
  3. Hey! The scraper if possible :)

    Thanks!

    ReplyDelete
  4. Sure... I'll see if I can locate it. It's someone else's code, and I don't even remember if I had tweaked it at all.

    ReplyDelete
  5. I found the code :) Rather than paste it in here, the original URL that I found it at was https://github.com/indigolemon/phpBB-Scraper. I didn't modify it, other than set the username, password, and URL to the site I was scraping.

    ReplyDelete
  6. Great thanks!

    I did find it yesterday and used it but after around 4.5 hours it was still running (and whilst the destination folder was rightly created on my machine it was empty). I guess it's taking time to scrape a whole forum!

    Have a nice day

    ReplyDelete
  7. I would maybe suggest setting the value of $start_topic and $end_topic maybe 10 or so apart (and make sure at least one valid topic # falls between the two values) and do a test-run, just to make sure it completes without error. Then crank up the value of $end_topic to a sufficiently high value to capture everything.

    ReplyDelete
  8. That's a good idea, thanks. I've launched the script on my server (which is in a datacenter so it's running 24/7 anyway) and still no luck after ~15 hours. The phpbb forum has more than 8000 topics so I'm not sure if it's normal or not.

    It's not very urgent so I can wait...

    ReplyDelete
  9. I don't really know how long it would take, as I've never benchmarked it. I know I've run it against a small forum, on a fairly fast server, but I honestly couldn't tell you how long it took. I do seem to recall that it worked, though, as I have a capture of that forum around somewhere.

    ReplyDelete