Concurrent e-mail fetching in Charm

Manfred Stienstra

The fine people at Charm asked us to solve a very specific technical problem for them. Charm relies on fast and reliable e-mail fetching so they needed a fetcher which could meet these requirements.

Unfortunately Charm is currenly no longer actively developed, but we hope you can still appreciate some of the insights in parts of its infrastructure.

Speeding things up

Charm's initial implementation was kept simple. E-mail fetching happened serialized in one process. It looped through a list of IMAP mail accounts, fetched messages, marked them as seen, and disconnected.

This worked fine with a limited number of accounts on well behaving servers. When the number of accounts began to grow, this approach ran into a number of bottlenecks. Speed is an important feature in a support environment, so this needed to be sorted out.

Some IMAP servers appeared to use connection queues for incoming connections and long timeouts when disconnecting. Fetching content was sometimes very slow, down to a few bytes per second. Possibly because of rate limiting.

E-mail fetching is a classic example in most concurrency literature and is easily solved with parallel batch processing. The accounts in Charm provide us with a very natural entity for partitioning the workload.

Where things get technical

We developed a little fetcher daemon named Fido, a fully concurrent e-mail fetcher backed by EventMachine. Fido maintains a queue of all configured IMAP accounts. The main loop hands out these mail accounts to a configurable number of worker processes. When Fido finishes the queue it adds all accounts back and starts over.

The workers usually spend a long time waiting for something to happen on their connection. Fetching is sped up by spawning a large number of workers.

If all the workers would be a full Ruby process memory usage would quickly get out of hand. EventMachine allowed us to keep a lot of connections open from just one process. Now the number of workers is mostly limited by the number of available outgoing ports on the server.

Comparing memory usage for Fido (blue) and process based fetchers (green). The bars show 1, 4, 8, and 16 workers. Shorter is less memory.

The web application controls Fido by writing the e-mail accounts to an encrypted configuration file. Fido watches this configuration file for changes and automatically reacts when new mail accounts are added or removed. It uses either kqueue or epoll depending on the operating system.

When new accounts are added to the configuration file they are bumped to the start of the processing queue. Fido can report a subset of configuration and connection errors back to the web application through a ReST API. This allows us to notify the Charm user immediately when a configuration change doesn't work.

Both configuration reloading and error notification is performed concurrently from the workers.

Fido stores the fetched e-mails in a Redis database. This database also serves as a processing queue for the e-mail importer.

Decentralized

The only centralized component in Fido's architecture is the main web application. The web application will only be contacted in case of failure and Fido can happily stay running when the web application is down.

Fido can scale across multiple servers by splitting up the mail accounts file.

Fully tested

Fido has an elaborate test suite. All the classes in the application are unit tested. The connection between the various components are tested trough a set of functional tests. A mock IMAP server specifically written for the test suite allows us to test various error conditions like disconnects and protocol errors. Finally a full-stack integration test is performed against an actual local running IMAP server.

In the future we would love to get CI running the integration tests against all supported IMAP servers. Possibly even a number of different versions.

Low maintenance deployment

It's easy to get 99% availability for your product, but we're aiming for as much nines as we can get. A possible source of downtime is the need for human intervention. We've tried to eliminate this and make it easy where necessary.

Fido has a “dry run” mode which allows it to run without actually changing anything inboxes and databases. This is ideal for testing dependencies and trying out new configurations.

Developers can run Fido locally during development without any need for configuration. When Fido is started for the first time on a clean system it will attempt to write its own configuration file with defaults. This configuration file is mostly self-explanatory so sysadmins can easily get it running.

Fido looks in the system configuration directory (/etc), web application deploy directories, and in the home directory of its process owner for its own configuration file. This makes it easy to deploy to just about any type of server without having to touch any code.

The daemon was set up as resilient as possible. It will fail gracefully when the main web application goes down and it automatically reconnects to Redis when necessary.

Fido reports the most important connection problems along with their debug information to the main web application so site maintainers can easily keep an eye on them. For more elaborate troubleshooting an administrator can drop down to the log file.