| 2-Feb-2021 | Like this? Dislike this? Let me know |
jq
(available here)
is the de facto standard for command line / shell script utilities
dealing with JSON. In the new JSON data ecosystem, it is as prevalent and important
and useful as sed and awk. Thanks to a compact expression
syntax and a good approach to dealing with arrays, jq makes it easy to
extract values from complex JSON structures.
But is it really the best multipurpose tool for the job, especially as the data
manipulation requirements become increasingly complex?
As your process control, data interrogation, and command input/output needs
become more complex, running it all in python is clearly more attractive.
Consider this JSON:
{
"sector":"ABC",
"items": [
{"name":"corn", "id":1, "hist": [
{"d":"2020-01-01","v":100},
{"d":"2020-01-02","v":200}
]},
{"name":"wheat", "id":2, "hist": [
{"d":"2020-01-03","v":300},
{"d":"2020-01-04","v":400}
]},
{"name":"rice", "id":3, "hist": [
{"d":"2021-01-03","v":500},
{"d":"2021-01-04","v":600}
]}
]
}
cat thatJson | jq -r '.items[] | select(.name != "corn") | "\(.name) was \(.hist[-1].v)" ' wheat was 400 rice was 600
Many cloud provider CLIs return complex JSON shapes; jq is a superb way to work with this content. For example, launching an AWS VM returns a complex structure -- but your VM is not ready yet. You must poll to find out when it is actually running. This is easily done in a shell script:
aws ec2 run-instances \
--count 1 \
--instance-type myType \
--security-group-ids quicklaunch-1 \
--tag-specifications ResourceType=instance,Tags=[{Key=Name,Value=Hello}] > zzz
IID=$(jq -r '.Instances[].InstanceId' zzz)
echo launched ID $IID
# The simple way to do this is just use the 'wait' command:
# aws ec2 wait instance-running --instance-ids $IID
# But we show the polling solution below because this gives the opportunity
# to do something while waiting (like printing status or dots for each loop, etc.)
while true
do
STATE=$(aws ec2 describe-instances --instance-ids $IID | jq -r '.Reservations[].Instances[].State.Name')
if [ "$STATE" == "running" ]; then
break
fi
sleep 10
done
# A bit of inefficiency calling it again (the PublicIpAddress field was in fact present
# in the payload where the State.Name was "running") but it is harder to deal with
# multiple value assignments in the shell
IP=$(aws ec2 describe-instances --instance-ids $IID | jq -r '.Reservations[].Instances[].PublicIpAddress')
echo "IP: $IP"
Here is a one-liner complete with VT100 color output that reports on AWS EC2 instances. Be careful to distinguish the shell pipes from the jq pipes! Too bad we don't have printf formatting directly in jq as we could eliminate using awk. Since we are using awk, arguably it would be better to put the color setting conditional logic in there instead but we show it in jq just to show off a bit.
aws ec2 describe-instances \
| jq -r '.Reservations[].Instances[]
| if .State.Name == "running" then .COLOR="\u001B[32m"
elif .State.Name == "stopped" then .COLOR="\u001B[31m"
else .COLOR="\u001B[34" end
| "\(.COLOR)\(.State.Name)\u001b[30m \(.InstanceId) \(.Tags[]
| select(.Key == "Name") | .Value)
\(.InstanceType) \(.LaunchTime) \(.PublicIpAddress)"'
and with some additional empty Tags protection (shout out to David-Z), color assignment and a little
more output control in awk (and some escaped CRs for clarity).
Note that since Instance Name in particular might have spaces, we use tilde
as a delimiter (too many bars already and colon and comma may pop up in Name
and/or IP as well).
aws ec2 describe-instances \
| jq -r '.Reservations[].Instances[]
| "\(.State.Name)~\(.InstanceId)~\(first(.Tags[]
| select(.Key == "Name").Value)? // "(none)")~\(.InstanceType)
~\(.LaunchTime)~\(.PublicIpAddress)" ' \
| awk -F '~' '{ ip=""; if($1=="running"){color=32;ip=$6} else
if($1=="stopped"){color=31} else {color=34} ;
printf "\033[%dm%4.4s\033[30m %-20.20s %16.16s %12.12s %s %s\n",
color, $1, $2, $3, $4, $5, ip;}'
first(.Tags[] | select(.Key == "Name").Value)? // "(none)"
Shell scripts, heredocs, backgrounding, and jq team up to make a powerful, compact, and performant ensemble:
for name in A B C D E F
do
TF=/tmp/$name.$$.cmd
cat <<EOS > $TF
aws ec2 run-instances ...
(various commands here)
EOS
bash $TF > /tmp/$name.response.json & # Background!
done
# Now wait for all those parallel executions to complete.
# This is a very powerful and useful idiom: Easily launch a bunch of things
# in the background with '&' and then wait for them all to finish:
wait
# When control returns here, A-F.response.json will have all JSON outputs which
# may be accessed via jq
At some point you need to start capturing and working with return codes, stdout, and stderr from these tasks. You'll also want the ability to easily examine the entire JSON data structure -- and potentially modify it -- and save it without rereading it over and over. Recall the "inefficiency" above. It is a lot less elegant when the shell has to deal with more than one piece of data coming out of jq:
while true
do
read -r name ip <<<$(aws ec2 describe-instances --instance-ids $IID \
| jq -r '.Reservations[].Instances[] | "\(.State.Name) \(.PublicIpAddress)"'
...
In general, commands executed in a shell script need to do this:
command args 1>theStdout.txt 2>theStderr.txt ; returncode=$?
MYVAR=$(command args 2>&1) ; RC=$?
A second point of irritation with shell scripts is arguments and quoting. Simple string and integer arguments work fine but consider trying to pass this to a command
command --opt1 val --opt2 "val2 val3" --opt3 " \"val4\" " \
--opt4 ' "noInterpInsideSingleQuotes" ' ...
Lastly, complex workflows touching different parts of the JSON data lead to lots of individual jq executions, each reading JSON input, modifying it, and writing it back out to a tmp file to protect against clobbering the file if a failure occurs:
QQ=$(jq -r '.aaa.bbb')
if [ condition ] ; then
jq -r '.this.that | . + {"foo":"bar"}' $FILE > $FILE.tmp && mv $FILE.tmp $FILE
else
jq -r '.other | . + {"code":401}' $FILE > $FILE.tmp && mv $FILE.tmp $FILE
fi
jq -r '.status = "COMPLETE"' $FILE > $FILE.tmp && mv $FILE.tmp $FILE
python3 brings a couple of big assets to the table:
Running a synchronous command from python is easy:
import subprocess
p1 = subprocess.run(['ls', '-l'], capture_output=True)
if 0 == p1.returncode:
print(p1.stdout)
else:
print("ERROR: ", p1.stderr)
0
'total 120
-rwxr-xr-x 1 user staff 57 Feb 1 15:29 args.sh
-rwxr-xr-x 1 user staff 37 Feb 1 15:27 args.sh~
-rw-r ...
Let's compare the aws example:
import subprocess
import json
cmd = ['aws', 'ec2', 'run-instances',
'--count', 1,
'--instance-type', 'myType'
'--tag-specifications', 'ResourceType=instance,Tags=[{Key=Name,Value=Hello}]' ]
p1 = subprocess.run(cmd, capture_output=True)
if 0 == p1.returncode:
data = json.loads(p1.stdout)
iid = data['Instances'][0]['InstanceId']
cmd2 = ['aws','ec2','describe-instances','--instance-ids', iid]
while True:
p2 = subprocess.run(cmd2, capture_output=True)
if 0 == p2.returncode:
rr = json.loads(p2.stdout)
inst = rr['Reservations'][0]['Instances'][0]
if "running" == inst['State']['Name']:
ip = inst['PublicIpAddress']
break
It is also possible to run commands "in the background" by using the lower-level Popen command. With a little bit of extra work we can create a background group object upon which a "wait" can be emulated, as follows:
import subprocess
class BG:
def __init__(self):
self.items = []
def launch(self, id, args):
oo = {"id":id}
oo['p'] = subprocess.Popen([str(x) for x in args],
stdout=subprocess.PIPE, stderr=subprocess.PIPE)
self.items.append(oo)
def wait(self):
for oo in self.items:
(oo['stdout'], oo['stderr']) = oo['p'].communicate()
oo['rc'] = oo['p'].returncode
def results(self): return self.items
bg = BG()
for n in range(0,3):
bg.launch(n, ['aws', 'ec2', 'run-instances', ... ])
# Three run-instances launched in background; wait for them:
bg.wait()
# This is the useful part: The bg results easily capture returncode, stdout, and stderr:
for oo in bg.results():
print(oo['id'], oo['rc'])
print('STDOUT: ', oo['stdout'])
print('STDERR: ', oo['stderr'])
Like this? Dislike this? Let me know