% date: 2018-03-08

POSIX Shell Scripting Survival Guide
====================================

Authors: ulcer <ulcer@sdf.org>
License: CC-BY-SA 4.0
Published: 2018-03-08
Updated: 2018-05-08
Link: gopher://sdf.org/0/users/ulcer/kb/kb/article-script-survival.md
Mirror: https://uuuuu.github.io/article-script-survival.txt

Buy me cookies: https://patreon.com/ulcer

Table of contents:

1. Introduction
    1.1. Note on shells: bourne shell syntax, bashisms and POSIX
    1.2. Tools of trade: BSD userland, GNU coreutils
    1.3. Shell scripting limitations
2. Scripting basics
    2.1. echo vs printf and few more details
    2.2. Variables, eval and quotation
    2.3. Conditions and loops: if, while
    2.4. Options parsing and validation: getopt vs getopts, case
    2.5. Life without arrays
    2.6. Change file
    2.7. Check lock
3. Commentary on tools
    3.1. Working with pipes: stdbuf and pv
    3.2. Notes on grep
    3.3. Notes on sed
    3.4. Notes on awk
    3.5. Notes on portable syntax
    3.6. Notes on UTF-8 compatibility
    3.7. Working with XML and HTML: curl, tidy, xmlstarlet and others
    3.8. Working with JSON: jq
    3.9. Working with CSV: miller
4. Advanced topics
    4.1. Reading user input
    4.2. Internal field separator
    4.3. Command line options manupulation
    4.4. Nested constructions difficulties and recursion
    4.5. Libraries and trap cascading
    4.6. Debugging
    4.7. Testing and deploying in new environment
    4.8. Networking with shell
    4.9. Paste safety
5. Further reading
6. References
7. Changelog

% body

1. Introduction
---------------

While  it  is true  that  tasks  shell  solves  are of  limited  scope,
using POSIX  shell/toolset you  may still do  plenty of  day-to-day and
administration/deployment/maintenance work  done not caring  much about
platform you use - BSD or GNU.

This  guide  was  motivated  by watching  fellow  SDFers  doing  common
mistakes and assumes you know how to  do "hello world" . It also should
provide you with answer to question  "how do I solve real life problems
with  this junk".  Given  number of  historical  tool alternatives  and
spreaded functionality, it's no obvious  question. This guide is highly
opiniated, when possible link with reasoning provided.

### 1.1. Note on shells: bourne shell syntax, bashisms and POSIX

Since this guide assumes you know some basics, it was probably bash. If
you didn't really dive into it (except for arrays), you should be aware
that  there's  minimal  number  of differences,  called  bashisms  [1],
holding you  from POSIX compliant  syntax (which is subset  of language
features you may  successfully run under any  major contemporary shells
without  extra  movements). This  may  also  drop question  about  bash
incompatibilities under different versions. Same bashisms curse follows
ksh users.

Other reason  for getting  rid of  bashisms is  real slowness  of bash.
While it's good  for interactive usage, there is better  choice, as zsh
holds  ultimate  position  at  command  completion.  Famous  ShellShock
vulnerability also suggests necessity of  less bloated shell for system
tasks.

You  may check  common bashisms  list  at [2]  and use  "checkbashisms"
script bundled in Debian "devscripts" package [3].

Closest to  POSIX standard shells  are Almquist shell  successors: dash
and busybox  ash. I usually develop  scripts in dash, which  is default
Debian system shell.  Bash usually doesn't need even  minor changes for
dash scripts,  but it can be  run in "bash --posix"  mode. Zsh provides
bare minimum default compatibility with POSIX syntax, requiring "zsh -c
'emulate sh; sh'" to run POSIX scripts.

As you  noticed, POSIX  sh has  one major drawback:  it has  no arrays.
Therefore there  were csh, ksh,  bash and so  on. One thing  you should
know looking  at shell  alternative, csh  (tcsh, if  you like)  is dead
branch [4] [5]. If you ask which shell is best compromise between POSIX
compliance and bare minimum improvement - arrays, it'd be probably mksh
(MirBSD ksh) which is already default Android shell.

Alternative syntaxes  not compatible with  POSIX sh: rc, fish  and zsh.
Fish is contemporary effort to  fish bash problems, which doesn't focus
on executation efficiency. You may surely take a look at rc publication
and plan9 ecosystem to spot problematic parts of traditional shell [6].
Zsh while being  not compatible with POSIX  syntax offers significantly
improved over bash interactive  experience and best currently available
completion system.

Shells  are  largely bloatware.  Example  comparison  of manpages  size
between shells (using "MANWIDTH=80 manpage dash | wc -l"): dash - 1590,
bash - 5742,  rc - 1143. Don't  be surprised when you hit  bug: dash at
the moment I extensively studied it  (using Debian 7 version) had empty
"trap" output and buggy printf operation.  So don't be afraid of trying
your portable work  on multiple shells (e.g. dash and  bash), if you're
not sure of shell misbehaviour.

### 1.2. Tools of trade: BSD userland, GNU coreutils

Part about  GNU/Linux in linux  name is about GNU  ecosystem, including
coreutils, like  "cat", "sed",  "tr" and  so on.  Relevant part  of BSD
world is called userland. Embedded systems often utilize busybox, which
bundles all these  tools under single binary with  often too restricted
functions even for POSIX.

You  may expect  any  of these  tools  to fully  support  UTF-8, to  be
adherent to POSIX standard and to be  bug free. Maybe in the future, so
prepare to switch tools on occasion.

Following  "do one  thing  and do  it  well", you  should  be aware  of
existing tools not to waste time  attempting to solve your problem with
wrong tool (like replacing newlines with sed).

### 1.3. Shell scripting limitations

Small subset of shell scripting deficiencies you should be aware:

- speed. Lack of arrays and fancy  operations with firing subshell each
  time you  need to do  something contributes to general  slowness. Yet
  individual operations (be it grep or awk) provide great speed benefit
  over scripting languages
- subshells  consume variables: it's  hard to make piped  constructs to
  return  multiple  values  and   impossible  to  carry  exception-like
  messages, and so forth
- untyped and unstructured values: shell efficiency can be extended to
  topic of typed/binary data flows but that's another story

Following comment in Structural Regular Expressions, "the silent limits
placed on line lengths by most tools can be frustrating." [7]

In general, deeper  you'll dive with shell  scripting, more limitations
you'll  discover. Given  all  subtle details  in  every small  facility
support (and even development) is a burden.

2. Scripting basics
-------------------

As  promised in  title,  it's more  like guide,  so  I won't  duplicate
detailed  answers  and  only  give  solutions  in  order  suitable  for
learning.

### 2.1. echo vs printf and few more details

To  make  story  short,  there are  differences  between  build-in  and
standalone echo implementations, there  are subtle backspace and option
parsing defails,  and you  shouldn't use echo  for anything  other than
fixed string  (preferable without  backslash and other  special chars).
For other cases you got printf. [8]

It's usually done like this:

    $ a=$( printf "%a\n" "${var}" | something )

Note that command substitution  consumes trailing newline, so depending
on what  you do  you may  need trailing line  returned back.  Piping to
while loop is quite common pattern:

    $ a=$( seq 1 5 ) ; printf "%s" "$a" | \
        while IFS= read -r var; do
            printf "%s\n" "${var}"
        done

Which gives only 4 result lines, as we haven't write 'printf "%s\n"'.

That said, you shouldn't use backticks for command substitution, always
favouring $() notation [9]. Also note  that "read" by default reacts on
backspaces, which is  turned off with "-r" switch. For  whole lines you
may also use line(1) utility.

In case you need sole newline character, it's done like this:

    $ nl=$(printf "\n "); nl="${nl% }"

There's  also no  way to  expand $'\'  bashisms, so  for sake  of being
univarsal you just printf to variable in octal.

You may see redirections in certain order, which is important. Like this:

    $ date >/dev/null 2&>1

One thing you should remember about them: they take precedence in order
they are written, so writing "2>&1 >/dev/null" won't disable stderr.

### 2.2. Variables, eval and quotation

Let's go back to string printing example:

    $ printf "%s\n" "${var}"

As you see,  I wrote variable inside of curly  braces inside of quotes,
as that  being the only  safe way of  using variable (quotes  usage and
variable expansion "${}" being separate topics). [10]

Regarding "${}" safety consider next examples:

    $ class="core"; var="path"; core_path="${HOME}"
    $ eval dest=\"\$$class_$var\"
    $ echo "Destination: ${dest}"

Which won't  work until you  cover "class"  from second line  in braces
(because it's recognized as "class_").

See next example with unquoted variable evaluation:

    $ AWK=mawk AWK_OPTS="-W interactive"
    $ { while :; do date; sleep 1; done; } | $AWK $AWK_OPTS '{print}' -

Next  example  splits  unquoted  QUERY_STRING  according  to  IFS  into
positional params, available with $1, $2 and so on:

    $ set -- ${QUERY_STRING}

All you have to  know about "eval" operation - it  best way of shooting
oneself  in the  foot, because  all your  newlines, semicolons  and all
kinds of expansions take place. Don't  do it without extreme reason and
proper quotation.

Unix filenames may include newlines. All the fuzz for proper quotation,
"xargs -0"  and such is about  safety from crashes and  other malicious
actions (e.g. with "while read line" loops).

Make it a  rule to quote variables  in double quotes any  place you use
them  to prevent  at  least  IFS split.  Double  quotes around  command
substitution ("$()") are not necessary.

One note  for variables scope.  POSIX doesn't define "local"  for local
variables in  functions, but  you may  find it  in literally  any shell
around.  In  other case  just  use  unique  names and  preserve  shared
variables (like IFS) in backups.

### 2.3. Conditions and loops: if, while

Let's look at syntax from this example:

    [ ! -f "${list_raw}" ] ||
        [ "${TMWW_DRYRUN}" != "yes" \
        -a $(( $(date +%s) - $( stat ${STATSEC} "${list_raw}" ) )) \
        -gt "${TMWW_DELTA}" 2>/dev/null ] && \
            fetch_all

When you write  conditions, "[" is equivalent to  "test" built-in (with
"]" being optional decoration). It's quite powerful operator, but error
messages too often  lack problem cause. First of all,  there's only "="
being correct (which is frequent typo).

Only  correct  syntax  for   shell  built-in  calculations  (arithmetic
expansion) is "$(())":  "$ i=$(( $i + 1 ))".  More complex calculations
are   solved  using   expr(1),  specifically   when  you   need  string
manipulation functions without resorting to mastodon like awk.

Thing you may often see in scripts  is ":" which is equivalent to "NOP"
machine instruction and does exactly nothing. Like this:

    $ while :; do date; sleep 1; done
    $ : ${2?aborting: missing second param}
    $ : >flush_or_create_file

Few words  about "if"  statement. Tests  like '[ "x$a"  -eq "x"  ]' are
archaic, related  to earliest  shells and absolutely  useless nowadays.
With tests written  as "test -n" or "test -z"  you shouldn't ever think
if variable is "empty"  or "unset", but something like '[  "$b" = "" ]'
is good too.

"while read"  piped constructs with  external calls are most  slow part
contributing to  general scripts speed.  They also suffer from  lack of
values with newline support. Being pretty obscure case, it still may be
addressed with xargs [11]:

    ... | tr '\n' '\0' | xargs -0 -n1

### 2.4. Options parsing and validation: getopt vs getopts, case

General note  on options notation:  there are short options  like "-a",
which  can be  written concatenated,  like "-ab  'value'" depending  on
how  smart your  option parser  is,  and GNU-style  long options,  like
"--version" and "--help" (this two  are most ubiquitous for GNU tools).
When you  need explicitly  tell option parser  to stop  accept options,
there's handy "--" empty option:

    $ kill -- "-${pgid}"
    $ random_input | grep -- "${filter}"

Note that "${filter}" variable in last  example may start with dash, so
it's always good to put "--" beforehand.

The only  "getopt" you should  use is  shell embedded "getopts".  If it
happens you need  long options for thing like shell  script, you really
should reevaluate right tool for your task. [12]

Common pattern for switching "shift" usage.

NOTE: if you struggle for  "yes/no" and other interactive facilities in
    your script, remember that  you loose all scripting/piping benefits
    of unix filter-type programs

Now we  came close to "case"  instruction; there are few  hints to care
about. First,  when you  treat cases,  make sure  you place  empty case
before "*" wildcard (note unquoted values):

    case "$s" in
        foo) echo bar ;;
        '') echo empty ;;
        *)  echo else ;;
    esac

You may also do basic validation:

    case "${input}" in
        *[!0-9]*) echo "Not a number" ; ;;
    esac

These  checks  are  limited  with   glob  patterns  (same  you  use  in
interactive sessions,  like "rm  *~"), so you  should grep/sed  for any
stronger validation.

### 2.5. Life without arrays

If you can't rely on kind of mksh for array support, then there's still
life there.

Most probably your  data is line oriented,  which you append/sort/uniq.
Let's query it:

    $ a=$( seq 1 3 ); seq 40 60 | grep -F "$a"

If you need  to search field separated data (key-value  or any kind of 
CSV)  per  line,  fast  lookup  is done  with  "join"  utility.  Here, 
"storage" file  is ordered by first  column CSV with 1st  column to be 
queried, second file is ordered term per line:                         

    $ join -t ',' storage request

The simple way  of walking string of words  is parameter substitution. 
Let's for example split word from word list:                           

    $ array="alpha beta gamma"
    $ head="${array%% *}"; array="${array#* }"

NOTE: if you can't remember which one  for prefix and suffix, "#" is on
    the left  (under "3") on  IBM keyboard,  which is prefix,  thus "%"
    being suffix (you're  reading this in english anyway).  See how you
    can split hostname from a link:

    $ link="gopher://sdf.org/0/users/ulcer/"
    $ hostname="${link#*://}"; hostname="${hostname%%/*}"

The other  includes splitting by IFS  and is prone to  errors. See "4.2
Internal field separator" chapter. The rest rely on printf and pipes:

    $ a=$( printf "%s\n" "$a" | grep -v "exclude_me" )
    $ result=$( printf "%s\n" "${date}"; while read line; do
            ...
        done; )

Size of variable you operate is limited to size of argument you pass to
underlying exec(3) call [13]. Usually it's order of hundred of KBs.

When you don't want to use awk's "getline" while-cycles, usual practice
is feeding  awk with  multiple files  and comparing  end of  first with
NR==FNR check:

    $ echo www-data | awk 'NR==FNR{split($0,a,":"); b[a[1]]=$0; next} \
        {print 123, b[$1]}' /etc/passwd -

Furthermore, jump  between files listed  on command line is  done with 
"nextfile" awk  command, which  is not POSIX  but is  widely supported 
(e.g. with gawk and mawk).                                                      

In order to pass  more data to awk(1) with line  oriented data you may 
send it as awk variable:                                               

    $ a=$( seq 1 10; ); echo | mawk -v a="$a" \
        'END{split(a,f,"\n"); for (i in f) print f[i]}'

But if you think  you need arrays of arrays, kinds  of linked lists and
so on  it's time to reevaluate  if you 'd  be able to read  this script
again if you solve everything with shell/awk.

### 2.6. Change file

File edition  is not that  trivial as you  may expect. Simplest  way is
e.g. inplace  sed editing  with GNU  sed "-i" switch.  But what  if you
don't have GNU tools or want to edit file with awk?

Usual template looks like this [14]:
    
    inplace() {
        local input
        tmp=$( mktemp )
        [ $? -ne 0 ] && { echo failed creating temp file; exit 1; }
        trap "rm -f '${tmp}'" 0
        input="$1"
        shift
        "$@" <"${input}" >"${tmp}" && cat "${tmp}" >"${input}"
        rm -f "${tmp}"
    }
    inplace "target_file" sed "s/foo/bar/"

You may certainly use mv or cp, but  cat here is most safe option as it
preserves permissions and hard links. This is behavior ed(1) provides.

Regarding  attributes: cat  is surely  safest way  of writing  changes.
One  more   or  less  complex   example  includes  sharing   files  for
creation/removal/write  access between  group  members, which  includes
2770 mode  on directory, umask 002  and proper FS ACL  settings. cat is
only utility  which won't  break permissions  on modification.  In such
environments you may additionally check files with "chmod +w".

Pay attention to pending disk writes with sync utility [15].

### 2.7. Check lock

Depending  on system  preferences,  /var/lock or  /var/run/lock can  be
available for unpriviliged locks.

Locking  using mkdir(1)  is  preferred to  other  methods because  it's
atomic  operation  (no  separate   "check  lock",  then  "create  lock"
operations) and is native.

    mkdir "$1" 2>/dev/null || { echo locked; exit 1; }

Prepare  parent path  with separate  call,  as "mkdir  -p" will  ignore
errors on existing target directory.

3. Commentary on tools
----------------------

When you lack arrays (and given  shell execution speed) you are kind of
forced to piping  data to specialized filter. Luckily  there are enough
of them nowadays.

### 3.1. Working with pipes: stdbuf and pv

You may  essentially expect  realtime output  in piped  constructs, but
result depends on line buffering capability of used tools.

stdbuf(1) is  a wrapper which forces target tool to output  string when
it's ready  [16]. This feature is  best employed when you  mangle heavy
datasets. Try something like this to get better idea:

    $ while :; do date; sleep 1; done | mawk -W interactive '{print}' -

Fast grep  won't help you  if you pass  results over tool  without line
buffering. Almost every  tool in your standard toolset  (grep, cut, tr,
etc.) requires tweaks or external wrapper.

Line buffering support per tool:

- grep: GNU grep has "--line-buffered" switch
- mawk: has "-W interactive" switch
- sed: has "-u" switch
- cat: OK
- gawk, nawk, tr, cut: require stdbuf wrapper

Solutions outside of linux vary. See [17] for detailed explanations and
solutions  comparison (e.g.  TCL  expect "unbuffer"  program). For  awk
"fflush()" (POSIX) may be attempted.

pv(1) helps  when you want to  know quantity of data  passed over pipe,
measure network  speed with netcat  and so  on [18]. mbuffer(1)  can be
used as alternative to pv, exclusive for buffering tasks.

### 3.2. Notes on grep

Looks like in 2018 you can't market "grep" without "data science", "big
data" and  other buzzwords.  Ok, here  are few  hints for  faster grep.
First  of  all, "-F"  switch  stops  interpretting pattern  as  regular
expression, which means faster grep. Other thing  to consider - setting "LANG=C"  instead of UTF
locale  and  utilization  of  your multiple  CPU  cores.  See  relevant
publications [19] for xargs(1) (see  "-P" switch) and parallel(1) (note
"--linebuffer" switch). Also don't forget about "-m" switch if you know
you need only few matches.

Again, remember  about "grep  --" when your  search argument  may start
from dash.

In case of  fuzzy search term, there's approximate  grep agrep(1). This
tool  searches regex  patterns  for given  levelnstein distance  (which
means  all kinds  of  letter rearrangement,  like  mixing, dropping  or
placing extra letter). Try agrep on next example to get idea:

    wesnoth     # agrep -0 wesnoth
    wesnoht     # agrep -1 wesnoth
    westnorth   # agrep -2 wesnoth
    western     # agrep -3 wesnoth

GNU grep is also good for searching binary data:

    $ LANG=C grep -obUaP "PK\x03\x04" chrome_extension.crx

Which can  also be performed with  binwalk(1). And if you  just want to
peek at file for readable strings, there's coreutils strings(1) tool.

### 3.3. Notes on sed

Most important  thing you should remember  about sed - it's  a tool and
not programming  language. People  may write brainfuck  interpreters in
sed - just leave them alone.

You  probably know  about  legacy  of basic  and  extended versions  of
regular  expressions  (try  man  7  regex on  Linux)  -  "Some  people,
when  confronted  with a  problem,  think  "I  know, I'll  use  regular
expressions."  Now they  have two  problems" //Jamie  Zawinski. Non  of
these  being POSIX  or  GNU knows  about  non-greedy quantifier.  Usual
workaround for look-ahead feature for single character looks like this:

    $ echo "foo https://www.ietf.org/standards/ bar" | \
        sed 's|http[s]://\([^/]*\)/[^ ]*|\1|'
    foo www.ietf.org bar

For non-greedy  replacement of  character sequence, target  sequence is
first replaced with unique for  given input character and then followed
previous example.

If  you want  to avoid  sed extended  regexp flag  inconsistency, usual
thing lacking in basic regexp is  "?" quantifier, which can be emulated
with "\(x|\)" kind of expression (sed prefers leftmost matching part).

GNU sed is unicode aware and supports lower/upper case conversion which
works like this:

    $ echo Aa Bb Cc | sed 's/\(..\) \(..\) \(..\)/\L\1 \U\2 \E\3/'
    aa BB Cc

Depending on  implementation sed  may force you  to write  each command
under separate "-e"  option and can be strict  about closing semicolons
";", like here:

    $ sed -n '/marker/{p;q;}' example

Safe assumption about sed is it  doesn't know about escape sequences so
you have to put desired character either literally (e.g. tab character)
or from shell variable, like "tabchar=$( printf \011 )".

### 3.4. Notes on awk

Which awk? Short answer is: gawk when you need UTF-8 support, mawk when
you don't. Relevant literature: mawk  vs gawk and other languages speed
comparison [20].

mawk 1.3.3 which is most widely distributed version suffers from number
of sore bugs, like lack of regex quantifiers "{}" support.

Few survival hints for awk:

- delete array: split ("", array)
- variable passed  as parameter  to  function:  should  be array,  like
  global[0], to be mutable:

    $ echo | mawk 'function a(){b=5} { b=2; a(); print b}'
    5
    $ echo | mawk 'function a(c){c=5} { b=2; a(b); print b}'
    2
    $ echo | mawk 'function a(c){c[0]=5} { b[0]=2; a(b); print b[0]}'
    5

### 3.5. Notes on portable syntax

When you want to know if tool  is available, most portable way is using
"command -v" which should be built-in:

    $ AWK=$( command -v mawk 2>/dev/null ); AWK="${AWK:-awk}"

Sometimes  you  have  to  access  user  local  directories,  which  FHS
(Filesystem  Hierarchy Standard)  doesn't cover.  Two well  established
paths   are  "~/.config"   and  "~/.local/share".   There  is   also  X
Desktop  Group  approach  of  searching  ~/.config/user-dirs.dirc  with
xdg-user-dir(1) with  XDG_CONFIG_HOME and  XDG_CACHE_HOME corresponding
to  previously mentioned  dirs. Graceful  query of  local configuration
path may look like this, but it's most certainly just "~/.config/":

    $ config="${${XDG_CONFIG_HOME:-${HOME}/.config}/ACMETOOL/config}"

Temporary  files  can   be  attempted  at  TMPDIR   (which  is  POSIX),
XDG_RUNTIME_DIR  (which  is  relatively   recent  addition),  and  then
resorted to /tmp.

Most common GNU/BSD discrepancies:

- sed extended regex switch: "sed -r" vs "sed -E"
- print file in reverse order: "tac" vs "tail -r"
- file modification date in seconds since epoch:
  "stat -c %Z" vs "stat -f %Uc"

BSD naming of GNU utils starts  with "g", like "gawk": "gdate", "gsed",
etc.  BSD date  is different  from GNU  date in  some details.  Further
examples for ISO8601 works well for both date versions:

    $ date -u +'%Y-%m-%dT%H:%M:%SZ'
    $ echo '2018-03-05T19:24:21Z' | date +%s

Last but not least, your  sort(1) results surprisingly may vary between
systems depending on LC_COLLATE setting.

### 3.6. Notes on UTF-8 compatibility

gawk  works with  UTF-8 by  default. awk  language wasn't  crafted with
multibyte encoding in mind, so there could be problems if you work with
binary data.  By default sprintf("%c",  150) will print  multibyte char
like "0xC2 0x96" and fail to compare string if you pass 0x96 from shell
variable like 'printf "\226"' in  dash. gawk here requires "-b" switch,
mawk works well by default.

GNU  sed is  unicode aware.  Other  seds support  vary; without  proper
support you won't get it working at all because of improper single char
"." length.

GNU "wc  -c" still doesn't  count UTF  chars properly, tr  also doesn't
work here, so you  have to resort to sed "y" command  with GNU sed. You
shouldn't rely on shell "${#var}"  expansion as it still often displays
number of bytes and not characters.

### 3.7. Working with XML and HTML: curl, tidy and xmlstarlet

This is relevant to web page scraping. 

There are  two URL  encodings you may  encounter: punycode  and percent
encoding. Punycode  of "xn--"  prefix covers  internationalized domains
and can be  converted back and forth with idn(1).

Percent encoding operations (urldecode and urlencode):

    $ echo 'a%20b+c' | sed "s|+| |g;s|%|\\\\x|g" | xargs -L1 printf "%b"
    $ printf "%s" test | od -An -tx1 | tr ' ' '%' | xargs printf "%s"

Pages are usually grabbed with curl(1)  or wget(1) with wget being more
suitable for  mirroring/recursive download.  curl invocation  with user
agent set and requesting compressed data:

    $ curl -H 'Accept-encoding: gzip' | gunzip -
    $ curl -H 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) \
        AppleWebKit/537.36 (KHTML, like Gecko) \
        Chrome/40.0.2214.85 Safari/537.36'

Picky servers require  referer and user agent to be  set. See also curl
manpage for  "retry" set of  options for responsible downloads.  If you
fetch  data repeatedly  within some  seconds, curl  is responsible  for
timeout  delays, where  not a  single options  ensures wget  to run  in
defined time.

Don't forget to strip CR characters with "tr -d '\r'".

When you're  done fetching, it's  time to  convert your HTML  source to
uniformed XML.  tidy(1) is corrects  and reformats HTML  to well-formed
XHTML [21]:

    $ tidy -iqw0 -asxml input.html >output.xml

When you finished scalpeling sample  input, you may validate and pretty
print it with xmllint(1), bundled with libxml(3):

    $ xmllint --format ugly.xml >pretty.xml

Now  it's time  to finally  make queries.  xmlstarlet(1) "sel"  command
provides compact way of writing XSLT right on command line:

    $ xmlstarlet sel --help

XML tree structure inspection command is handy for unfamiliar input:

    $ xmlstarlet el -u

Finally,  you  may  omit  schema definition  by  using  default  schema
shortcut "_":

    $ xmlstarlet sel -t -v "//_:p[@content]"

One notable  limitation of  xmlstarlet is input  file size.  It doesn't
provide SAX parser and will fail on files of few hundred MB size.

Most important,  W3 maintains  set of C  tools dedicated  for XML/HTML 
works,  like  hxtoc  TOC  creation,  hxpipe  conversion  to  YAML-like 
structure, more usable for awk(1)  parsing, hxselect for quarying with 
CSS selector and others [22].                                            

### 3.8. Working with JSON: jq

JSON  (RFC4627)  does  what  it  has  to  -  serializes  data  in  text
representation without much  bloat, so you can read it  without kind of
decoder [23]. It  lacks standard validation methods, but  work has been
done in that direction [24].

Best tool to mutate and query JSON data is jq, which is already popular
enough.

I should mention that YAML with entity per line also lacks bloat of XML
and is just perfect for parsing with awk without extra tools.

jq  is also  tolerant to  JSON  lines input  (each line  is valid  json
expression) [25], which is compromise for mungling it with grep/awk and
jq and solves disgusting problem of root braces {}. Besides, it's great
for parallelization.

Just an example jq call to rename field "a" to "b":

    $ jq -c 'with_entries(if .key == "a" then {key:"e",value} \
        else . end)'

### 3.9. Working with CSV: miller

Most common  predecessor for CSV  data is some flavour  of spreadsheet.
There  is ubiquitour  python-based  csvkit, but  probably more  correct
would be Gnumeric supplementary converter [26]:

    $ ssconvert input.xls output.csv

Now  that we  have our  data ready,  let's process  it. Without  quoted
values, usual tools are enough: awk with "-F" flag, cut, sort with "-k"
flag. Handy shortcut for skipping CSV fields on search:

    $ field="[^,${tabchar}]+[,${tabchar}]"
    $ egrep -m1 "^${field}${field}${search_string}[,${tabchar}]"

See "4.2 Internal field separator" chapter for more examples for CSV.

miller(1) [27] jumps  in straight where jq got its  place: it's easy to
drag around  single binary, it combines  speed of awk with  features of
popular  scripting  languages CSV  tools.  Recently  (around 2016?)  it
finally got double quoted RFC4180 compliant CSV support.

Depending  on your  language preferences,  there  are too  much of  CSV
processing tools. Two of such are  csvkit and q (both in python). While
csvkit being obvious, q is way to query CSV with SQL [28].


4. Advanced topics
------------------

Next problems are not related to average scripting needs. As was stated
before, think twice about right choice of tool.

### 4.1. Reading user input

You may  attach GNU readline  capabilities for any input  using rlwrap,
just  note that  it'll obviously  store input  history (consult  rlwrap
manpage). Here is Debian specific example for running SDF chat program:

    $ rlwrap annotate-output ssh ulcer@sdf.org 'com 2>/dev/null'

For  timeout  of  line/char  reading routins,  most  portable  solution
includes stty with  dd. In other case you may  get "line -t", coreutils
timeout(1) or bash "read -t" working.

### 4.2. Internal field separator

Which  you usually  shouldn't  touch, because  this  path carries  only
pitfalls.

IFS  is  composed of  characters,  which  act  as field  separator  for
unquoted values in eval loop. First  IFS char works for output too. IFS
can be  handy when  you are  certain about  proper composition  of your
input.  Any  user  provided  input  without  proper  quotation  carries
possibility of execution of arbitrary code.

Let's see how it works:

    $ tab=$( printf '\011' )
    $ get_element() { eval printf \"\%s\" \$$(( $1 + 1 )) ; }
    $ get_csv() { local id IFS; id="$1"; shift; IFS=",${tab}"
        get_element $id $* ; }
    $ get_csv 3 'foo,\\$(date);date \$(date),bar'
    bar

    $ print_csv() { local IFS; IFS=,; printf "%s" "$*"; }
    $ print_csv a b c
    a,b,c

Such  code doesn't  outsource tasks  to external  calls and  especially
pipes of external  calls, but itself doesn't  provide greater advantage
over e.g. printf to cut(1) though.

IFS is often used in read cycles:

    # read whole line
    while IFS= read -r line; do
        printf "%s\n" "${line}"
    done

    # read CSV fields with residue going into $c3
    debug() { printf "<%s>\n" "$@"; }
    while IFS=, read -r c1 c2 c3; do
        debug "$c1" "$c2" "$c3"
    done <csv_example

Preserve your IFS  in some variable if you're going  to write script to
be  inlined somewhere.  Also  note  that there's  special  set of  unit
separation characters from ASCII, good for unique IFS: 1C-1F (see RFC20
or try man 7 ascii).

### 4.3. Command line options manupulation

You don't usually want to do this, though there are few useful cases.

Options restored  for getopts  parsing require resetting  of previously
invoked getopts  with OPTIND set to  1. This is also  important if your
script is going to be inlined.

    OPTIND=1
    while getopts ...

Example code to  preserve and rearrange (as seen  with "prefix" leading
parameter) positional params:

    cmd_prefix="prefix"

    escape_params() {
        for i in "$@"; do
            printf "%s" "$i" | sed -e "s/'/'\"'\"'/g" -e "s/.*/'&' /"
        done
    }

    params="${cmd_prefix} "$( escape_params "$@" )
    printf "DEBUG %s\n" "${params}"
    eval set -- ${params}
    printf "DEBUG arg: \"%s\"\n" "$@"

    shift 3

    # save params
    params=$( escape_params "$@" )

    # split string
    split_me="a:b:c"
    backifs="${IFS}"
    IFS=:
    set -- ${split_me}
    printf "TEST arg: \"%s\"\n" "$@"
    IFS="${backifs}"

    # restore params
    eval set -- ${params}
    printf "DEBUG arg: \"%s\"\n" "$@"

This code doesn't handle parameters with newlines.

### 4.4. Nested constructions difficulties and recursion

Pipes  spawn  subshells. This  is  a  problem  because you  can't  pass
variable from  subshell script to  parent script, which  causes painful
code rearrangements and workarounds.

One workaround  for being  unable to  pass variable  upwards is  to use
temporary files as stdin for e.g.  while loops. Other way is capture of
stdout/stderr output  into variable. Make  sure you know  what "Useless
Use of cat Award" is about [30].

You  should  make  habit  of  writing  conditional  switches  in  whole
"if-then-else" form and not compact "[ ] &&/|| something" form, for "["
being shell build-in and rarely incorrectly interpreted.

Recursion is  useful in shell  scripts, because without  piped subshell
calls scope of variables doesn't  go isolated. Safe assumption would be
something below  1000 calls (artificial  limit on mksh and  zsh, others
prefer crash). Test with:

    $ sh -c 'a() { echo $1; a $(( $1 + 1 )); }; a 1'

### 4.5. Libraries and trap cascading

You may source heavy chunks of code formed in libraries, which may also
perform  some  kind  of  initialization. One  way  of  preventing  such
duplicate init code runs is dedicated library management with core code
like this:

    require_plugin() {
        if ! printf "%s" "${plugin_array}" | \
            grep -qw "$1" 2>/dev/null; then
            echo >&2 "Exporting: $1"
            plugin_array="${plugin_array} $1"
            . "$LIBPATH/$1"
        fi
    }

Your very  first inlined  script will rise  question about  parent trap
handlers preservation. Here is proposed solution:

    # $1 -- trap name (to prevent duplicates)
    # $2 -- trap command
    # $3 $* -- signals
    trap_add() {
        # stay silent on incorrect options and on fail to set trap
        [ -z "$3" ] && return
        printf "%s" "${trap_array}" | grep -qw "$1" 2>/dev/null && return 0
        trap_array="${trap_array} $1 "
        trap_cmd="$2"; shift 2
        for trap_signal in "$@"; do
            trap -- "$(
                extract_trap_cmd() { printf '%s\n' "$3"; }
                eval extract_trap_cmd $( trap | \
                    sed -n "H;/^trap --/h;/${trap_signal}$/{x;p;q;}" )
                printf '%s\n' "${trap_cmd}"
            );" "${trap_signal}" || return
        done
    }
    # debug
    trap_add test 'echo 123' INT TERM EXIT
    trap_add test2 'date' INT TERM EXIT
    trap

Never  forget to  test your  code chunks  in multiple  shells, as  e.g.
dash  and mksh  do not  provide  POSIX exception  for treating  command
substitution  with empty  trap as  not called  from subshell,  where it
should dump existing signal handlers  (so former example will work only
in bash/zsh and require preservation  of all trapped signal handlers in
dedicated variables for being portable).

### 4.6. Debugging

Debugging shell  scripts is done either  with "set -x" shell  option to
output  executed commands  with timestamp,  which also  works well  for
performance troubleshooting, or with debug printfs all around the code.

Usual debug routine is 'printf "debug \"%s\"\n" "$@"', which expands to
param per line.

hexdump(1) may  be missed  on target  machine. Alternatives  are xxd(1)
often distributed with vim(1) or od(1), which od being most ubiquitous.

Problem cases can be much obscure.  One common pitfall is forgotten ";"
inside of "$()"/"{}"/"()" constructs  with closing parentheses being on
same line with last instruction.

### 4.7. Testing and deploying in new environment

shtest [29] is a single file  in POSIX shell without any dependency. It
takes different approach  from unit tests in that is  doesn't force you
to write anything. You copy commands  to be tested with some prefix (by
default with "$", like any  documentation does), run "shtest -r", which
records those commands  output and exit statuses and  finally run tests
in new  environment (or under  different shell)  to get diff  output if
something  went  wrong. Also  shtest  tests  are easily  embeddable  in
markdown documentation.

shtest was inspired  by similar cram tool in python  [31], just that it
doesn't need python to run.

For  more  systematic   approach  see  BATS  [32],   which  can  output
TAP-compatible messages, suitable for continuous integration.

### 4.8. Networking with shell

This one  is completely esoteric  part, which carries  little practical
value.  With  already mentioned  curl,  doing  its  job for  HTTP  (and
GOPHER!) protocols,  it's horribly inefficient  but yet possible  to do
simple tasks with binary data flow  under just shell and awk. You still
need few specialized tools.

tcpick(8) encodes byte streams into hex [33]:

    /usr/bin/stdbuf -oL /usr/sbin/tcpick -i eth0 \
        "src $1 && port 5122 && \
            tcp[((tcp[12:1] & 0xf0) >> 2):2] = 0x9500" -yH | awk ...

Actually, you  may do full-fledged  networking with just netcat  and sh
[34]. This  one was  functional online game  client. Netcat  stream was
read  word by  word with  dd, expect-like  behavior and  binary packets
parsing were  done in pure  shell. It still  serves as good  example on
scalpeling binary data with only shell.

### 4.9. Paste safety

It's survival guide  after all, so let's step away  from target subject
and look at how you actually  do scripting. Obviously, noone reads mans
and writes code by copying parts from web.

See this question  [35] for details and at the  very least inspect your
paste with "xsel -o | hd".

5. Further reading
------------------

POSIX standard is your best friend [36].

You should  obviously see manpages.  Dash and mawk manpages  are pretty
excellent and  compact for learning  corresponding topics and  using as
language reference.

For GNU bloatware important info  is often contained within GNU info(1)
infopages (which means man pages miss what you may seek, take a look at
e.g. sed). This  format should certainly diy  one day but so  far it is
accessible  with info(1)  or  somewhat abandoned  pinfo(1) for  colored
output. And of course is available at [37].

Then take a look at next resources:

- comp.unix.questions  / comp.unix.shell  archived FAQ,  which contains
  still relevant answers for particular scripting topics [38].
- harmful.cat-v.org [39]

And finally few of individual widely circulated pages:

- Rich's POSIX sh tricks, covering advanced topics [40]
- Sculpting text with regex, grep, sed, awk [41]
- 7 command-line tools for data science [42]

Bash FAQ, which still covers lot of newbie POSIX-related pitfalls: [43]

6. References
-------------

1. Introduction

[1] Ubuntu Wiki, "Dash as /bin/sh"
    https://wiki.ubuntu.com/DashAsBinSh
[2] Greg's Wiki, "How to make bash scripts work in dash"
    https://mywiki.wooledge.org/Bashism
[3] checkbashisms - check for bashisms in /bin/sh scripts
    https://manpages.debian.org/testing/devscripts/checkbashisms.1.en.html
[4] Tom Christiansen, "Csh Programming Considered Harmful", 1996-10-06
    http://harmful.cat-v.org/software/csh
[5] Bruce Barnett, "Top Ten Reasons not to use the C shell", 2009-06-28
    http://www.grymoire.com/unix/CshTop10.txt
[6] Tom Duff, "Rc — The Plan 9 Shell"
    http://doc.cat-v.org/plan_9/4th_edition/papers/rc
[7] Rob Pike, "Structural Regular Expressions", 1987
    http://doc.cat-v.org/bell_labs/structural_regexps/se.pdf

2. Scripting basics

[8] Stéphane Chazelas. "Why is printf better than echo?"
    https://unix.stackexchange.com/a/65819
[9] Why is $(...) preferred over `...` (backticks)?
    http://mywiki.wooledge.org/BashFAQ/082
[10] Fred Foo. "When do we need curly braces around shell variables?"
    https://stackoverflow.com/a/8748880
[11] Tobia. Make xargs execute the command once for each line of input
    https://stackoverflow.com/a/28806991
[12] Dennis Williamson. "Cross-platform getopt for a shell script"
    https://stackoverflow.com/a/2728625
[13] Graeme. "What defines the maximum size for a command single argument?"
    https://unix.stackexchange.com/a/120842
[14] William Pursell. "sed edit file in place"
    https://stackoverflow.com/a/12696585
[15] Wikipedia - sync (Unix)
    https://en.wikipedia.org/wiki/Sync_(Unix)

3. Commentary on tools

[16] Pádraig Brady, "stdio buffering", 2006-05-26
    http://www.pixelbeat.org/programming/stdio_buffering/
[17] Aaron Digulla. "Turn off buffering in pipe"
    https://unix.stackexchange.com/questions/25372/turn-off-buffering-in-pipe
[18] Martin Streicher, "Speaking UNIX: Peering into pipes", 2009-11-03
    https://www.ibm.com/developerworks/aix/library/au-spunix_pipeviewer/index.html
[19] Adam Drake, "Command-line Tools can be 235x Faster than your Hadoop Cluster", 2014-01-18
    https://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html
[20] Brendan O'Connor, "Don’t MAWK AWK – the fastest and most elegant big data munging language!", 2012-10-25
    https://brenocon.com/blog/2009/09/dont-mawk-awk-the-fastest-and-most-elegant-big-data-munging-language/
[21] The granddaddy of HTML tools, with support for modern standards
    https://github.com/htacg/tidy-html5
[22] HTML and XML manipulation utilities
    https://www.w3.org/Tools/HTML-XML-utils/README
[23] Douglas Crockford, "JSON: The Fat-Free Alternative to XML", 2006-12-06
    http://www.json.org/fatfree.html
[24] JSON Schema is a vocabulary that allows you to annotate and validate JSON documents
    http://json-schema.org/
[25] JSON Lines
    http://jsonlines.org/
[26] The Gnumeric Manual, "Converting Files"
    https://help.gnome.org/users/gnumeric/stable/sect-files-ssconvert.html.en
[27] Miller is like awk, sed, cut, join, and sort for CSV
    https://johnkerl.org/miller/doc/index.html
[28] Run SQL directly on CSV files
    http://harelba.github.io/q/

4. Advanced topics

[29] Useless Use of Cat Award
    http://porkmail.org/era/unix/award.html#uucaletter
[30] shtest - run command line tests
    https://github.com/uuuuu/shtest
[31] Cram is a functional testing framework based on Mercurial's unified test format
    https://bitheap.org/cram/
[32] Bats: Bash Automated Testing System
    https://github.com/sstephenson/bats
[33] tcpick with awk example
    https://github.com/uuuuu/tmww/blob/master/utils/accsniffer
[34] shamana - tmwa ghetto bot engine made with POSIX shell 
    https://github.com/uuuuu/shamana
[35] Sam Hocevar. "How can I protect myself from this kind of clipboard abuse?"
    https://security.stackexchange.com/questions/39118/how-can-i-protect-myself-from-this-kind-of-clipboard-abuse

5. Further reading

[36] The Open Group Base Specifications Issue 7, 2016 Edition
    http://pubs.opengroup.org/onlinepubs/9699919799/
[37] GNU Coreutils
    https://www.gnu.org/software/coreutils/manual/html_node/index.html
[38] Unix - Frequently Asked Questions
    http://www.faqs.org/faqs/unix-faq/faq/
[39] Encyclopedia of things considered harmful
    http://harmful.cat-v.org/
[40] Rich’s sh (POSIX shell) tricks
    http://www.etalabs.net/sh_tricks.html
[41] Matt Might: Sculpting text with regex, grep, sed, awk
    http://matt.might.net/articles/sculpting-text/
[42] Jeroen Janssens, "7 command-line tools for data science", 2013-09-19
    http://jeroenjanssens.com/2013/09/19/seven-command-line-tools-for-data-science.html
[43] Greg's Wiki, "Bash Pitfalls"
    http://mywiki.wooledge.org/BashPitfalls

7. Changelog
------------

2018-03-08 initial release
2018-03-29 ADD missed to mention W3 hx* tools from html-xml-utils
    ADD few newbie awk hints
2018-04-11 ADD portable urlencode/urldecode, fast querying with "join"
2018-05-08 FIX example in 3.4 "Notes on awk" about mutable variable passed as parameter