Technology
Contact Info in Emails
 Figure 1 - Click here to enlarge
Many emails are received with a “signature” at the end,
usually consisting of the sender’s name and contact details (see Figure 1).
These details are potentially useful for the recipient, but having to copy
them manually into an address book is difficult and error-prone,
and so they are often neglected. An automatic method to find
them is the obvious way out of this situation.
Easy, isn’t it?
At first glance this might seem like a very easy task. After all, looking at Figure 1, one might think it is enough to build something that will:
- Go to the end of the email message.
- Find something that looks like a name (how about: two words beginning with capital
letters).
- Find something that looks like a title (how about: a few words beginning with capital
letters, right after the name).
- Find something that looks like a company name (how about: a few words beginning
with capital letters, right after the title).
- Find something that looks like an address (how about: a few lines with a five-digit
number for zip code).
- Find one or more phone numbers.
- Look for keywords (such as “fax”) to interpret these phone numbers.
- Find something that looks like an email address (“aaaa@bbbb.com”).
- Find something that looks like the company Web site (“www.cccc.com”).
- Put these in the appropriate fields of a Contact item in the address book
and we’re done!
 Figure 2 - Click here to enlarge
 Figure 3 - Click here to enlarge
Aren’t we?
Well…
Emails are typically human-generated, free-format texts. Although the signature is often
added automatically by the sender’s email application, the signature itself is also
human-generated and thus its format is not fixed.
Our nice and simple list of ideas will break in the following cases, and in many others.
- Go to the end of the email message.
And what if the message carries with it a long “thread” of previous messages (Figure 2)?
- Find something that looks like a name (how about: two words beginning with capital
letters).
And what if the name is “Dr. John W. Jackson-Smith”?
- Find something that looks like a title (how about: a few words beginning with capital
letters, right after the name).
And what if there’s no title given (Figure 3)?
- Find something that looks like a company name (how about: a few words beginning
with capital letters, right after the title).
And what if the company name comes at the end (Figure 3)?
- Find something that looks like an address (how about: a few lines with a five-digit
number for zip code).
And what if there is no zip code? Or a nine-digit zip code? Or come to think of it, no address at all (Figure 3)?
- And so on…
Finally, what if there’s no signature block at all? Or what if there’s a legal
disclaimer block mixed in with the signature (Figure 3 – note the red markings)?
Will the system just grab any phone number from there and – wrongly! – place it in the Contact item?
Clearly it is not a trivial task.
 Figure 4 - Click here to enlarge
 Figure 5 - Click here to enlarge
 Figure 6 - Click here to enlarge
 Figure 7 - Click here to enlarge
 Figure 8 - Click here to enlarge
What we do
Using sophisticated mathematical algorithms we have created a unique, patent-pending process to overcome these difficulties
and extract this information with a very high rate of success.
First of all we strip away the “thread”, if there is one. Keywords and
other “hints” (Figure 4) guide us to decide where the “true” part of
the message ends. This is essential because the thread may contain contact
info of other people – often even your own contact info!
Note that a sophisticated algorithm is required even at this stage.
A naïve decision that the true part ends as soon as there appear “thread”-type
keywords risks cutting the message too early. Only a mathematical model that
takes such cases into account can work here.
Next we use a variation of the same model to classify regions within
the true part of the message. A message typically contains such regions
as a “greeting” at the top, a “body” that is the actual text of the message,
a phrase like Best Regards at the end, and finally a signature block – see Figure 5.
All these regions are optional – to take an extreme example, a message can even be empty.
There are also other, less common regions such as a “header” and a “footer”
(typically containing such text as “If you can’t read this message, click here” and
“To unsubscribe click here” and often appearing in mass-mailing messages).
Once again our model uses a combination of keywords and other hints.
See Figure 6 for examples of these. At the time of the first release of our product,
the model had about 500 different words and hints, and the list continues to grow.
At the end of this stage we know where the signature block is, if there is one.
Interpreting the signature block
Another variation of the algorithm is now used to break the
signature block into its component parts.
Once again there are a number of possible parts, such as “name”, “title”,
“company”, “address” and “phone” and once again they are optional.
Some of these (like the phone number) can appear more than once – but then,
some (like the address) can appear only once but they can take up more than
one line. And once again a long list of keywords and hints is needed, as
well as a complex algorithm to interpret these correctly.
At last we know what is what in the signature block (Figure 7). This is a
screenshot from our internal “lab” software, which was used to analyze thousands of email messages to help us build the model.
A fourth and final pass is used to make the final classification into the
fields as they appear in a Contact item in Microsoft Outlook – leaving only
the phone number (plus optional extension number) and removing anything else from
that line, separating “office phone” from “home phone”, and doing some general cleanup.
The result – Figure 8 – is what you would have had to cut-and-paste manually if it weren’t for
Nobex Contacts for Microsoft Outlook.
Overall this patent-pending algorithm gets through the whole process successfully for the
vast majority of emails with signatures, thus saving a lot of error-prone manual
entry of contact information. This despite the fact that, as we have seen,
emails are extremely variable in their format, and even the address block itself is
far from standardized. While no algorithm is ever perfect, Nobex is continually
improving the results, using our ever-growing database of email examples from
ourselves and from our customers.
|