Getting away with data – Migrating to Kolab Now – Mail Part #1 (preparation)

By   2015-09-25

Before I begin, I’d like to recommend this series of articles. I borrowed many of the ideas described there, especially about mail migration.

In the last entry I stated that I had the following components to migrate from Gmail to Kolab:

  1. Mail;
  2. Address book;
  3. Calendar;
  4. Tasks;
  5. Files;
  6. Notes.

My migration plan consisted of lots of steps and several moving parts, but I broke it down into two “roll outs”:

  1. Mail and contacts;
  2. The rest.

I did this because at first I was only testing the service and went for the “lite” subscription, which saves you some money. It gives you the mailbox and an address book, so the most critical needs for the majority of people are met. I can imagine lots of scenarios where such combo is more than enough (a network of distributed shops where the workers just need to exchange information with each other and invoices with some central warehouse in the region).

At some point I decided I need my calendars to join the rest, so I upgraded, but I’m getting ahead of myself.

What is e-mail exactly?

When working with the first and the hardest step I had to dig into some things I wasn’t completely aware of before embarking on this journey. I wasn’t fully aware what e-mail was.

jen_interview

In essence: an email message is a piece of text that got specially formatted and consists of an envelope, a head and body. Multiple mail messages may be stored in one big text file (like in Mozilla Thunderbird) or in a maildir directory structure (like in Kontact, although Kontact gives you numerous other options).

So: all I have to do is to copy-paste some *.txt files and eat my noodles? Far from it!

  1. The mailboxes are most likely to have different sizes (scaling up is not a concern, it’s the scaling down where you need caution);
  2. Different providers probably use different mail servers which means various capabilities will be available or not;
  3. The folder structures also may differ;
  4. Many messages have attachments: how big attachments are allowed on the destination server?
  5. Does the destination provider use some kind of upload throttling (I’m looking at you, Microsoft)?
  6. The encoding may be different (pray it’s UTF8 on both sides);
  7. And a thousand of other things…

Thankfully there are tools that take away the burden from you so you can focus on trusted, widely tested solutions. One of them is imapsync by Gilles Lamiral. Gilles is a really nice guy who created imapsync and offers both paid and free support. 50€ for a support ticket? For any bigger company, that’s like free and he’s really helpful. And immensely patient. Check out his page and the mailing lists.

What is imapsync you may ask (apart from the obvious you’ve extracted from its self-descriptive name)? It’s a 6k+ lines long Perl script. Yep, Perl – the all in one, it slices, it dices it even makes coffee scripting language which strongest domain is text processing. And because email messages are text, Perl makes an excellent choice!

Yes, you can use other methods, you can even setup Mozilla Thunderbird and do a drag and drop operation on both mailboxes and wait for the sync, but with imapsync you get precious diagnostic data. Good luck finding that one critical e-mail with the only copy of important_stuff.pdf without logs. No, seriously: measure twice, cut once. Take your time and don’t get distracted by shortcuts, this will save you headaches in the long run. If working in IT has taught me anything it is this:

Do it right the first time or it will bite you when least expected.

I used getmail to create a backup of all my mail. Granted, imapsync won’t delete your mail from the source mailbox unless strictly told to, but the rule of thumb states:

Always have backup!

How much email is too much?

If you recall from the previous post, I had three mailboxes hosted by Google:

  1. Gmail mailbox feeding mail from a university mailbox in addition to lots of direct messages [size: big; complexity: medium] [called ham];
  2. A Gmail mailbox feeding data from 3 different, legacy mailboxes I used to use; given as registration address for lots of services, many of them defunct [size: medium; complexity significant] [called spam];
  3. A Gmail mailbox I used for my resumes, much more official, least used [size: small; complexity: small] [called: eggs].

The cute nicknames I gave them are arbitrary, I could argue with myself that while spam was getting lots of unwanted mail, it was ham that had most of the legacy things that needed to go forever. I think I was not alone with those problems (mailboxes with overlapping purposes and dozens of ad hoc registrations are common sins of mail users worldwide I think).

Google lets you check the quota you have on your mail and other services. You should absolutely check this before starting the migration, otherwise you won’t know what capacity to order from Kolab Now. This works both ways: why pay for something you don’t use and it’s better to be sure all the needed mail fits into the box. It’s up to you to decide, but I was pleasantly surprised: I have the smallest, 2GB mailbox and I’m not using even 50% of it. Before finishing the migrations I was afraid I could run out of space.

Step #1 – cleanup time!

Be prepared for an often unpleasant emotional and mental exercise. Mailboxes are the contemporary graveyards of data. Many people don’t even know what archives are, let alone use them. And those who know often abuse them (I’m looking at myself now). Moreover: email is where many dreams come to die and many sad news about other people or worse, ourselves lie, festering and decomposing.

  • Going into your past correspondence is a trip back in time.
  • Actually it’s really funny, how small and naive in comparison some “problems” seem.
  • Also: it’s interesting to trace how some people are referenced in the database: some contacts got their names changed depending on the dynamics of our relationships.
  • Also the topics of the conversations and what exactly was spoken can be mind-boggling.
  • Not to mention the send-reply ratio, that with some people was just… “Why on Earth did I even bother wasting my time on you!?”
  • On the other hand you may find interesting and amusing things, like giantmicrobes.com.

Do yourself a favor: forgive yourself and move on. Delete stuff (and make yourself a magical snowman).

Important note: Gmail by default renders messages composed into threads, so when you delete a huge thread, you actually move to the bin much more messages. That’s a detail worth keeping in mind since imapsync shows messages only, not threads.

Sub-step: delete unneeded emails with attachments

The goal is to reduce both the number of messages and their weight. The key here is killing as many attachments as humanly possible.

I managed to go from over 500 emails with attachments to 162 (some of them had more than 20MB). Especially some subscriptions and ads like to bloat the mailbox. I once got a promotional email from a gym with a 2MiB image. Granted, the girl looked attractive (hey, it’s a gym after all), but:

  • Apart from the person on the image, there was lots of text that could easily be in the message itself and;
  • Why send an image in high quality if most people won’t view it in full scale?
  • Bonus round: way to go to annoy mobile users, some of them pay good money for data transfers.

Sub-step: download the needed attachments

Not all files are garbage, you may still need them for later. But do yourself a favor, don’t keep them in the mailbox. A mailbox is not an FTP server, nor a shared drive. Don’t be lazy: you may need that space in the future for some critical things. Use this occasion to download the attachments and back them up locally. You can back them up to a cloud share (like Kolab).

Good to know when downloading attachments:

  • When you have one file and you download it, the file gets downloaded as-is;
  • If you download a group of files (2+) they get packed into a *.zip file which bears the name of the conversation;
    • BUT, all risky characters as “:” and “/” get stripped from the file name, Polish characters also (ex. “różne” -> “rne”);
  • If there is no topic, the package will just be called “Gmail”;

Actually I’m ignoring one piece of information: Kolab allows you to save attachments directly to the cloud from the email application, but again: do you really need all of those files accessible globally? And even if yes, the folder structure and mental models may have changed, so save all files somewhere on the computer and upload them when their time comes again.

Before and after

And now, a real life example of things that created garbage:

  • old birthday invitations with attachments;
  • scanned documents, ~1MiB per page;
  • photos, some of them of RAW-like quality;
  • some music samples;

In total:

  • 970 e-mails went into the bin (some with attachments);
    • Gmail quota before: 1,46 GB;
    • Gmail quota after: 0,28 GB.

From 1,5GB I’ve went to about 300MB! Not bad at all!

/tangodelta