proxy70

From: dbucklin@sdf.org
Date: 2018-04-17
Subject: Evernote Extraction

I  take notes all the time.  I love having access to my notes wher-
ever I go.  Evernote does that.  However, I've become  increasingly
dissatisfied  with  the complexity of their client software.  Also,
they recently stopped supporting Geeknote, a CLI client.  [1]  Gee-
knote has it's own problems, so maybe it's time to make a change.

After  evaluating  a number of solutions, I settled on vimwiki. [2]
Vimwiki will let me manage my information in plaintext  and  I  can
even  publish an HTML version of it.  My entire collection of notes
should be small enough that I can pull everything down to my phone.
Now I just have to extract my data from Evernote.  Easy, right?

Evernote  doesn't make a desktop client for linux, so I fired up my
Mac Mini since I need to use the desktop client to export my  data.
I  exported  each  of my notebooks into a separate enex file (Ever-
note's XML format).  Looking at it, I wonder  if  it's  even  valid
XML.  How am I going to get my data out of here?

My  first  move  is to install html-xml-utils.  After experimenting
with `hxpipe` and `hxextract`, it  seems  like  html-xml-utils  are
more about manipulating html/xml and retaining the format, not fil-
tering the data away from the format.

I had a quick chat with tomasino [3] and  he  referred  me  to  ev-
er2simple  [4].  Ever2simple is a tool that aims to help people mi-
grate from Evernote to Simplenote.  After some trial and  error,  I
was able to install ever2simple, but I first had to install python-
pip, python-libxml2, python-lxml, and python-lxslt.

I'm starting with one of my smallest notebooks, a journal, just  so
I  can  prove the concept.  I want to migrate these journal entries
to my journal.txt file that I maintain with jrnl. [5] I  tried  the
`-f dir` option first, hoping this would just give me a folder full
of text files.  That's exactly what it does, but there's no metada-
ta.   I  need the timestamps.  Using ever2simple with the `-f json`
option gives me my metadata, but now everything is in a  huge  JSON
stream.   After  some experimentation with sed, I conclude that sed
is not the right tool for this job.

I remember hearing about something called `jq` that should  let  me
work  with JSON.  The apt package description for `jq` starts with,
"jq is like sed for JSON...".  Well, I'm sold.  Also, no  dependen-
cies! What a bonus.  The man page is full of explanations and exam-
ples, but I'm going to need to experiment with the filters.   After
some experimentation, I land on

     jq '.[] | .createdate,.content' journal.json

This cycles through each top-level element and extracts the create-
date and content values.  Now I wonder how I can add a separator so
that  I  can dissect the data into discrete files with awk or some-
thing.  I should be able to add a literal to the list of filters.

     jq '.[] | .createdate,.content,"%%"' journal.json

Well, the %% lines include the quotes, but that's not  the  end  of
the  world.   I wonder what date format I need for jrnl.  Each jrnl
entry starts with

     YYYY-MM-DD HH:MM Title

Evernote gives me dates that look like

     Jul 25 2011 HH:MM:SS

`date --help` to the rescue!

Looking at date handling in `jq`, I should be able to  convert  the
dates  from  the format used by Evernote to the format used by jrnl
with the filter

     strptime("%b %d %Y %H:%M:%S")|strftime("%Y-%m-%d %H:%M")

All together, then.

     jq '.[] | (.createdate|strptime("%b %d %Y %H:%M:%S")|strftime("%Y-%m-%d %H:%M")),.content,"%%"' journal.json

I still have some garbage in there, but I'm getting close to  being
able  to  just  prepend this to my journal.txt file.  OK, I'm close
enough with this:

     jq '.[] | (.createdate|strptime("%b %d %Y %H:%M:%S")|strftime("%Y-%m-%d %H:%M")),.content,"%%"' journal.json | sed -e 's/^"//;s/"$//;s/\\n/\n/g' | sed -e '/^ *$/d' >journal.part

Okay, let's try the recipes notebook.  My recipes  notebook  should
be  a little more challenging than my journal entries, but it's not
as massive as my main notebook.

     ever2simple -f json -o recipes.json recipes.enex

My journal json file was 5k.  This one is 105k.  Running  the  same
command  as  before gives me pretty legible output.  I know some of
these notes had attachments, but I don't see them in the  JSON.   I
wonder if they are mime-encoded in the XML file.

Looking  back  at my recipes.enex file, attachments do appear to be
base64 encoded in the XML, but ever2simple doesn't copy  this  data
into  the  JSON file it creates.  This makes sense since its target
is Simplenote.  Maybe html-xml-utils can help me  get  these  files
out.

     hxextract 'resource' recipes.enex

It  looks like the files are encapsulated within resource elements.
The resource element contains metadata about the attachment and the
base64-encoded data itself is inside a data element.  I can isolate
the data using hxselect.

     hxselect -c -s '\n\n' data < recipes.enex > recipes.dat

This gives me all the mime attachments  in  a  single  file.   Each
base64-encoded  file  is  separated  by two newlines.  This doesn't
preserve my metadata, but I'm anxious to get the data out  and  see
what's in there.  Let's see if I can pipe the first one into base64
-d to decode it.  An awk one-liner should let me  terminate  output
at the first blank line.

     awk '/^$/ {exit}{print $0}' recipes.dat | base64 -d > testfile

Now I can use `file` to find out what kind of file it is.

     file testfile

This tells me that it's an image.  A JPEG, to be specific, and it's
300 dpi and 147x127.  That seems seems small.  I wonder if Evernote
encoded  all  of  the  images  that were in the html pages I saved.
Opening the file in an image viewer, I can see that that's  exactly
what it is.  How many attachments are in there?  Could I...

     sed -e '/^./d' recipes.dat | wc

Damn,  that's slick.  There are 74 files in there.  I'll bet only a
handful of them have any value to me.  I think the easiest  way  to
go  forward  is  to copy each base64 attachment into it's own file.
Looking at split(1), it splits on  line  count,  not  a  delimiter.
What if I do something like...

     #!/usr/bin/awk -f
     BEGIN {fcount=1}
     /^$/ {fcount++;}
     { print $0 >> "dump/" fcount ".base64"}

This  goes through my recipes.dat file and puts each base64-encoded
attachment into its own file.  Now I need to decode them  and  give
them an appropriate suffix.

     #/bin/bash
     for f in dump/*
     do
       outfile="${f%.*}.out"
       base64 -d "${f}" > "${outfile}"
       type=$(file ${outfile})
       type="${type#* }"
       type="${type%% *}"
       newout="${outfile%.out}.${type}"
       mv "$outfile" "$newout"
     done

Phew!  Now  I  have  74  files  to look through.  Most of these are
garbage from web pages I saved.  There's really only five of  these
that I want to keep.  There are a few problems with this approach:

   * I lose the original file name.
   * I use the file utility to reconstruct the filename extension.
   * I lose the association between the file and the note.

This  has  been  a  lot of work, and there's a lot more to be done.
Looking at my main notebook, I may revisit ever2simple's  `-f  dir`
option.   I  could even look at the source and see if there's a way
to tack on metadata.

I assume there are better ways to go about this, but I  love  chal-
lenges  like this because it's an excuse to learn new tools and get
better at using the tools I'm already familiar  with.   Next  time,
I'll show you what happens next, and how I migrate this information
to vimwiki.

## References

1. http://www.geeknote.me/
2. https://vimwiki.github.io/
3. gopher://gopher.black
4. https://vimwiki.github.io/
5. http://jrnl.sh