For some time now I’ve been crawling the web and doing scraping at scale for a stealth project of mine, but I’ve not really been using a headless browser solution because it’s overkill for about 70% of the sites I scrape. However, I do need to scrape the other 30%, so I recently added the capability to scrape AJAX pages using phantomjs and poltergeist.
The big problem I’ve run into is that poltergeist leaks file descriptors, and if you just let it go, you’ll get a “Too many open files” error after a few hours. I won’t get into the details (mainly because I don’t recall them… I fixed this problem a few weeks ago), but this is a jruby-specific problem.
At any rate, I monkey patched poltergeist so that it doesn’t try redirect phantomjs’s console output to either STDOUT or a user-provided IO object. That STDOUT redirection was the root of a lot of pain on jruby, and since I don’t need the console output I just ripped out the offending code.
I’ve posted a gist with the very simple monkey patch, which you can include in /config/initializers/poltergeist.rb. This has been working in production without a problem for a few weeks under jruby 1.7.11 + rails 4.02. I crawl about 230 sites simultaneously right now with this setup, though only about 30 of them are using phantomjs.
On a related note, a general word of caution to anyone who wants to do what I’m doing, i.e. web crawling at scale with jruby + sidekiq + rails on ec2: you absolutely must call driver.quit on your Capybara::Session object every time you’re completely finished with it, or else you will end up with zombie processes and you’ll run out of memory.
What I do is initialize a new session object at the start of every scraper worker, and then when that worker is done as part of its shutdown I make sure to close all http connections (including quitting the session object) as part of the cleanup. Put this cleanup code in an ensure block in your sidekiq worker’s perform method, to make sure that it will get called even when the worker crashes.