You may think at first glance, "What is this question supposed to be? What an email is, is quite clear!". What from a typical user's perspective is initially certainly simple, correct and comprehensible to answer, can, however, become a difficult question from a more technical (and also compliance) perspective. Especially when it comes to archiving emails and the question of the original of an email. The financial administration reviews the email archiving of the taxpayer, among other things, on the basis of the GoBD's criteria for properness, according to which the restoration of the original of each archived email must be possible.

Of course, Benno MailArchiv has had the ability to display every archived email in its original form since the very beginning. The email archived in native RFC 822 format is retrieved from the archive, unpacked and displayed as "source code". Neither is this anything special, nor does it explain our initial question. After all, what comes from the mail server or the user mailbox is archived.

Originals and Email Copies

But what happens if there are two, three, or even more copies instead of just one relevant original email? However, we're not talking about explicitly not simple duplicates, such as multiple retrieved emails from a mailbox or similar. Recognizing email duplicates (or doublets) is one of the most basic tasks of a mail archiving solution. Benno MailArchiv has mastered this for just as long as displaying emails in their original form.

“Multiple relevant originals of the same email? Impossible!” you say? Not in complex environments, such as those that occur again and again in practice among large hosting companies and cloud service providers (CSPs). Here, emails arrive multiple times in the “processing funnel” of Benno MailArchiv due to infrastructure-related circumstances.

The facts described below are relatively easy to understand from a technical point of view. But compliant email archiving is far more than the technical mapping or implementation of (GoBD and Compliance) requirements. It is an IT solution in the field of tension between user concerns, legal requirements and possibly general compliance requirements.

We hereby invite you to discuss with us what an email is (or maybe not). Possibly your image of something as trivial as emails will change after reading this article. Write us an email or use the comment function at the end of this post to tell us your opinion on this.

Archiving emails is generally a relatively simple task from a technical perspective. There are (roughly outlined) user mailboxes, mail servers, and transport paths. The whole thing is equipped with corresponding interfaces. Depending on the local conditions, there are often several ways to archive emails (in a manner appropriate to the relevant requirements).

Mail archiving on premises or in-house

A mail archiving system set up in an on-premise operation is often characterized by the fact that all emails to be archived are always and without exception archived via the same path or the same selected mechanism. Although the email delivery to the archive may be individual depending on the customer and its IT environment, it can be stated that the path of all emails in on-premise installations (once it is defined and set up) is de facto always the same. Once the connection to the mail archive is completed, everything else runs quasi automatically. All emails to be archived always take the same path. One size fits all.

The advantage of this situation is the uniformity. The mail delivery to the archive follows a fixed defined schema. There are usually no exceptions to this. E-mail duplicates are thus largely excluded. The rest, i.e. the sorting out of any actual duplicates, is taken over by the duplicate detection of Benno MailArchiv at the moment of archiving.

Email archiving in the cloud

A completely different picture often emerges, however, in large and complex environments. For example, especially in the infrastructures of larger hosting and cloud providers. Due to a variety of possible circumstances (different mail transport routes, different mail transport servers (MTAs) and different feed strategies to the archive (active via SMTP or passive via IMAP or POP3, etc.)) it happens again and again that the same e-mail is transported to the archive multiple times via different routes and is thus pending for archiving multiple times.

In particular, different mail transport routes and MTAs cause the transported emails to be provided with route-specific transport headers depending on the transport route taken. If a specific email is sent to the archive multiple times via different routes and is provided with different headers each time (which is unavoidable with different MTAs), different emails are delivered to the archive from the perspective of duplicate detection.

Let's take a closer look at this using an example:

In a complex infrastructure, a specific email 'M' is delivered to Benno MailArchiv for archiving via three different paths. As each of the three copies of this email (M1, M2, M3) is transported via different paths, different headers are entered for each email copy. The email copies are identical in content, i.e., from the user's perspective, they are the same. Nevertheless, they differ formally and technically from each other due to the different headers.

The headers make the difference

From the perspective of duplicate detection of Benno MailArchiv, the three email copies M1, M2, and M3 are undoubtedly different emails. During archiving, a SHA256 checksum is generated for each delivered email. Due to the different headers (of the otherwise identical email copies), three different checksums "C1", "C2", and "C3" inevitably result for the different copies. Thus, the three copies of the same email are considered different emails from the perspective of Benno MailArchiv. They would therefore be archived individually as such. This, in turn, would have the consequence that from the user's perspective (i.e., related to the purely message-related content of the email), it appears as if three identical emails are contained in the archive. When searching through the message content, all three email copies would be found and displayed.

In such an environment, conventional duplicate detection does not make sense. Who wants to find several (according to the message content) identical emails when searching? And how can it be achieved that only one of the three email copies is archived?

Complexity requires simple solutions

As is well known, complexity is most likely compensated by inner simplicity and simple solutions, rather than by complex solution constructs.

Let's take a closer look at the matter, we find that an email is already uniquely identifiable by the following headers and additionally by its message text:

  • Envelope-From – X-REAL-MAILFROM
  • Envelope-To – X-REAL-RCPTTO
  • Return-Path
  • Subject
  • Message-Id
  • Date
  • From
  • To
  • Cc
  • Body

Of course, especially when transported via SMTP, additional specific headers (so-called Received headers) are added to the email. The contents of these Received headers depend on the actual transport path of the respective email. This means that if two emails, M1 and M2, are identical in terms of the aforementioned (non-transport-specific) headers, they are definitely the same email – regardless of which and how many transport-related headers are still contained in the email.

Conclusion: The transport path may cause additional headers. As a result, an email becomes an email with several non-identical copies. The transport-related headers are not significant for the message or content uniqueness of the email (their presence merely documents the transport path taken).

In addition, not only the Received headers, but also DKIM signatures (and other elements) are not directly related to the content of the email. These headers can be attributed to the envelope of an email.

Thus, the solution for the situation with multiple email copies, which are technically different but content-wise identical, is near: While for compliance reasons the checksum over the entire email is mandatory, a second checksum, which is based exclusively on the above-mentioned email components, resolves the dilemma just as simply and effectively.

Practice versus formal criteria and compliance

Some topics can be elegantly solved technically. However, some practical solutions fail in everyday life due to mundane formal aspects. So it is also appropriate here to take a closer look, because compliance unfortunately takes precedence over practicability with regard to email archiving:

The GoBD-compliant mail archiving requires the ability to restore every email in its original state from the archive. Every email must be displayed including all headers, attachments, etc., i.e., virtually in the "message source text". In addition, every email must be verifiable with regard to any manipulations. Specifically, the consistency or integrity of an archived email as well as the entire archive content can be verified using the checksum mentioned above over the entire email.

Let's return to our example: If several textually or content-wise identical emails M1, M2, and M3 (in the sense of different copies of the same email) are fed into the archiving process, the question arises as to how to proceed with the different email copies with regard to their headers.

A view on the legal aspects of this situation

We assume that, purely legally speaking, there is no compulsion to archive multiple versions of an email, especially if they only differ with regard to the mail headers they contain. However, since it cannot be ruled out that different circles have different legal opinions, it is possible that our assumption will not meet with unanimous approval or is even legally wrong.

In order to rule out a possible (legal) dilemma, all copies of the email in question (in our example M1, M2, M3) should therefore be archived (considered pragmatically).

If one considers the situation purely formally, it is a matter of several mail copies in the sense of our example being different emails. Even if the differences between them are technical in nature and the practical benefit of the differences between them is not or hardly present in everyday life, formally they are different emails. This can be immediately verified by the different checksums.

On the other hand, any legal uncertainty can be easily excluded by archiving all mail copies. Even if it would be technically feasible using the procedure described above with two different checksums per email, e.g. to archive only the first copy of a series of mail copies

To achieve a legally secure solution for the operator, we recommend discussing this matter with a legal representative of your choice before implementation. Only after consulting and deciding on the specific form of implementing duplicate detection should the implementation be carried out accordingly.

For now, we assume that it could be legally sufficient to apply simplified duplicate detection, i.e., to archive only one of multiple identical email copies. Due to the GoBD requirements, creating a procedural documentation is mandatory and beyond question. We assume in this context that explaining or documenting the fact that only one email copy is archived should be sufficient to achieve a legally secure archiving.

The decision on the type of duplicate detection used and the associated responsibility towards the financial administration lies solely and exclusively with the operator.

What is an e-mail?

Is an e-mail now every copy, even if two copies differ only by a single transport header, which in turn does not contribute to increasing the information value of the message? Is an e-mail thus to be classified purely according to formal criteria? Or is an e-mail a message between two or more users, the essential (and also relevant for GoBD-compliant archiving) part of which is the actual message? Are the mail headers (which usually remain hidden from the user anyway) also not so important in terms of their relevance to the GoBD (even if they are archived as part of the original mail)?

What exactly an email is, remains open in this sense for the time being.

Now it's your turn, dear reader! Write us an email or a comment on what you think about the nature and scope of an email.

Legal Notice / Disclaimer

This article does not constitute legal advice. It is for general information only. We assume no responsibility for the accuracy or completeness of the information. Any liability is excluded.