Best way of merging mbox files
When concatenating mbox files like described here https://xaizek.github.io/2013-03-30/merge-mbox-mailboxes/. You will end up with an 'unsorted' mbox file. Is this going to be a problem esspecially when they are large >2GB's and new emails will be written to it? The email client nicely sorts the message from folder A "foldera 5 last" as last, but of course the mbox is not like this.
Is there a better solution for merging files?
Having:
A folder A with messages: From Subject Received Size test foldera 1 16:18 665 B test foldera 2 16:18 665 B test foldera 3 16:18 665 B test foldera 4 16:18 665 B test foldera 5 last 16:29 670 B
A folder B with messages: From Subject Received Size test folderb 1 16:23 665 B test folderb 2 16:24 665 B test folderb 3 16:24 665 B test folderb 4 16:24 665 B test folderb 5 16:24 665 B
[@ mail] cat .foldera .folderb > .folderc
Getting a folder C with messages: From Subject Received Size test foldera 1 16:18 665 B test foldera 2 16:18 665 B test foldera 3 16:18 665 B test foldera 4 16:18 665 B Mail System Internal Data DON'T DELETE THIS MESSAGE -- FOLDER INTERNAL DATA 16:19 454 B test folderb 1 16:23 665 B test folderb 2 16:24 665 B test folderb 3 16:24 665 B test folderb 4 16:24 665 B test folderb 5 16:24 665 B test foldera 5 last 16:29 670 B
On Thu, 29 Nov 2018, Marc Roos wrote:
When concatenating mbox files like described here https://xaizek.github.io/2013-03-30/merge-mbox-mailboxes/. You will end up with an 'unsorted' mbox file. Is this going to be a problem esspecially when they are large >2GB's and new emails will be written to it?
I don't think it will be a problem, but you might have to remove some headers (like the UUID header?). However, I think dovecot ought to be able to cope with it anyways and regenerate the indices.
The email client nicely sorts the message from folder A "foldera 5 last" as last, but of course the mbox is not like this. Is there a better solution for merging files?
As noted, the time order gets scrambled -- using your mail reader to get it back in time order requires sorting, an intensive operation.
It just so happen I've done this recently with a (GNU) awk script that merges multiple mailboxes into one mailbox, preserving time order. It assumes that each message starst with a From envelopes header with sorted timestamps e.g.
From mickey@disney.com Thu Nov 25 18:45:37 2018
From mickey@disney.com Thu Nov 25 18:45:37 2018 -0400
Your're welcome to use it. There's probably a more elegant way with doveadm/dsync. Using a mail reader to sort the merged mailbox, then drag/drop/copy everything into a final mailbox could also work.
Joseph Tam <jtam.home@gmail.com>
#!/bin/sh # # Merge multiple mbox's into one assuming that each message # starts with /^From .* {year}$/ and they are sorted by time. # # -- Joseph Tam <jtam.home@gmail.com> #
[ x"$*" = x ] && { echo "Usage: $0 mbox-file ..." exit 1 }
gawk -v boxes="$*" </dev/null '
function Tstamp(header) {
# Format: Jan 22 21:00:48 2018 -0700
# 12345678901234567890123456
l = length(header)
spec = (substr(header,l-4,1)=="-")? substr(header,l-25,20) : substr(header,l-19,20)
spec = substr(spec,17,4) " " ym[substr(spec,1,3)] substr(spec,4,3)
" " substr(spec,8,2) " " substr(spec,11,2) " " substr(spec,14,2)
return int(mktime(spec))
}
function DumpMessage(i) {
if (header[i]!="") {
printf("%s\n",header[i])
}
while ((getline x <mbox[i])>0) {
if (x~/^From .*[0-9][0-9][0-9][0-9]$/) {
stamp[i] = Tstamp(x)
header[i] = x
printf("%s => [%d] %d\n",header[i],i,stamp[i]) >"/dev/stderr"
return
}
print x
}
printf("EOF[%d]\n",i) >"/dev/stderr"
stamp[i] = 2147483647
header[i] = ""
}
BEGIN {
ym["Jan"] = "01"; ym["Feb"] = "02"; ym["Mar"] = "03"; ym["Apr"] = "04"
ym["May"] = "05"; ym["Jun"] = "06"; ym["Jul"] = "07"; ym["Aug"] = "08"
ym["Sep"] = "09"; ym["Oct"] = "10"; ym["Nov"] = "11"; ym["Dec"] = "12"
n = split(boxes,mbox," ")
# Read first header line from all boxes
for (i=1; i<=n; i++) {
DumpMessage(i)
}
# Loop until all maiboxes read
while (1) {
t = 2147483646
# Find next message
for (i=1; i<=n; i++) {
if (stamp[i]<=t) {t=stamp[i]; j=i;}
}
# If no more message, quit
if (t==2147483646) exit
# Dump next message from mbox[j]
DumpMessage(j)
}
}'
aside from cat?
On Thu, Nov 29, 2018 at 03:07:58PM -0800, Joseph Tam wrote:
On Thu, 29 Nov 2018, Marc Roos wrote:
When concatenating mbox files like described here https://xaizek.github.io/2013-03-30/merge-mbox-mailboxes/. You will end up with an 'unsorted' mbox file. Is this going to be a problem esspecially when they are large >2GB's and new emails will be written to it?
I don't think it will be a problem, but you might have to remove some headers (like the UUID header?). However, I think dovecot ought to be able to cope with it anyways and regenerate the indices.
The email client nicely sorts the message from folder A "foldera 5 last" as last, but of course the mbox is not like this. Is there a better solution for merging files?
As noted, the time order gets scrambled -- using your mail reader to get it back in time order requires sorting, an intensive operation.
It just so happen I've done this recently with a (GNU) awk script that merges multiple mailboxes into one mailbox, preserving time order. It assumes that each message starst with a From envelopes header with sorted timestamps e.g.
From mickey@disney.com Thu Nov 25 18:45:37 2018 From mickey@disney.com Thu Nov 25 18:45:37 2018 -0400
Your're welcome to use it. There's probably a more elegant way with doveadm/dsync. Using a mail reader to sort the merged mailbox, then drag/drop/copy everything into a final mailbox could also work.
Joseph Tam <jtam.home@gmail.com>
#!/bin/sh # # Merge multiple mbox's into one assuming that each message # starts with /^From .* {year}$/ and they are sorted by time. # # -- Joseph Tam <jtam.home@gmail.com> #
[ x"$*" = x ] && { echo "Usage: $0 mbox-file ..." exit 1 }
gawk -v boxes="$*" </dev/null ' function Tstamp(header) { # Format: Jan 22 21:00:48 2018 -0700 # 12345678901234567890123456 l = length(header) spec = (substr(header,l-4,1)=="-")? substr(header,l-25,20) : substr(header,l-19,20) spec = substr(spec,17,4) " " ym[substr(spec,1,3)] substr(spec,4,3)
" " substr(spec,8,2) " " substr(spec,11,2) " " substr(spec,14,2) return int(mktime(spec))}
function DumpMessage(i) { if (header[i]!="") { printf("%s\n",header[i]) } while ((getline x <mbox[i])>0) { if (x~/^From .*[0-9][0-9][0-9][0-9]$/) { stamp[i] = Tstamp(x) header[i] = x printf("%s => [%d] %d\n",header[i],i,stamp[i]) >"/dev/stderr" return } print x }
printf("EOF[%d]\n",i) >"/dev/stderr" stamp[i] = 2147483647 header[i] = ""
}
BEGIN { ym["Jan"] = "01"; ym["Feb"] = "02"; ym["Mar"] = "03"; ym["Apr"] = "04" ym["May"] = "05"; ym["Jun"] = "06"; ym["Jul"] = "07"; ym["Aug"] = "08" ym["Sep"] = "09"; ym["Oct"] = "10"; ym["Nov"] = "11"; ym["Dec"] = "12"
n = split(boxes,mbox," ") # Read first header line from all boxes for (i=1; i<=n; i++) { DumpMessage(i) } # Loop until all maiboxes read while (1) { t = 2147483646 # Find next message for (i=1; i<=n; i++) { if (stamp[i]<=t) {t=stamp[i]; j=i;} } # If no more message, quit if (t==2147483646) exit # Dump next message from mbox[j] DumpMessage(j) }
}'
-- So many immigrant groups have swept through our town that Brooklyn, like Atlantis, reaches mythological proportions in the mind of the world - RI Safir 1998 http://www.mrbrklyn.com
DRM is THEFT - We are the STAKEHOLDERS - RI Safir 2002 http://www.nylxs.com - Leadership Development in Free Software http://www2.mrbrklyn.com/resources - Unpublished Archive http://www.coinhangout.com - coins! http://www.brooklyn-living.com
Being so tracked is for FARM ANIMALS and and extermination camps, but incompatible with living as a free human being. -RI Safir 2013
On Thu, 29 Nov 2018, Ruben Safir wrote:
aside from cat?
"cat" is fine if you're OK to have all your message in block sorted (as opposed to globally) sorted order. I think I made that plain in my writeup.
Most mail readers will sort by the order in which it appears in the mailbox. If you want it time order (i.e. newest first/last), and the mailbox is not sorted this way, you'll incur the overhead of sorting it each time you have an index view of your message (unless your mail reader caches these things).
Joseph Tam <jtam.home@gmail.com>
participants (3)
-
Joseph Tam
-
Marc Roos
-
Ruben Safir