2-Feb-2021 | Like this? Dislike this? Let me know |
jq
(available here)
is the de facto standard for command line / shell script utilities
dealing with JSON. In the new JSON data ecosystem, it is as prevalent and important
and useful as sed and awk. Thanks to a compact expression
syntax and a good approach to dealing with arrays, jq makes it easy to
extract values from complex JSON structures.
But is it really the best multipurpose tool for the job, especially as the data
manipulation requirements become increasingly complex?
As your process control, data interrogation, and command input/output needs
become more complex, running it all in python is clearly more attractive.
Consider this JSON:
{ "sector":"ABC", "items": [ {"name":"corn", "id":1, "hist": [ {"d":"2020-01-01","v":100}, {"d":"2020-01-02","v":200} ]}, {"name":"wheat", "id":2, "hist": [ {"d":"2020-01-03","v":300}, {"d":"2020-01-04","v":400} ]}, {"name":"rice", "id":3, "hist": [ {"d":"2021-01-03","v":500}, {"d":"2021-01-04","v":600} ]} ] }
cat thatJson | jq -r '.items[] | select(.name != "corn") | "\(.name) was \(.hist[-1].v)" ' wheat was 400 rice was 600
Many cloud provider CLIs return complex JSON shapes; jq is a superb way to work with this content. For example, launching an AWS VM returns a complex structure -- but your VM is not ready yet. You must poll to find out when it is actually running. This is easily done in a shell script:
aws ec2 run-instances \ --count 1 \ --instance-type myType \ --security-group-ids quicklaunch-1 \ --tag-specifications ResourceType=instance,Tags=[{Key=Name,Value=Hello}] > zzz IID=$(jq -r '.Instances[].InstanceId' zzz) echo launched ID $IID # The simple way to do this is just use the 'wait' command: # aws ec2 wait instance-running --instance-ids $IID # But we show the polling solution below because this gives the opportunity # to do something while waiting (like printing status or dots for each loop, etc.) while true do STATE=$(aws ec2 describe-instances --instance-ids $IID | jq -r '.Reservations[].Instances[].State.Name') if [ "$STATE" == "running" ]; then break fi sleep 10 done # A bit of inefficiency calling it again (the PublicIpAddress field was in fact present # in the payload where the State.Name was "running") but it is harder to deal with # multiple value assignments in the shell IP=$(aws ec2 describe-instances --instance-ids $IID | jq -r '.Reservations[].Instances[].PublicIpAddress') echo "IP: $IP"
Here is a one-liner complete with VT100 color output that reports on AWS EC2 instances. Be careful to distinguish the shell pipes from the jq pipes! Too bad we don't have printf formatting directly in jq as we could eliminate using awk. Since we are using awk, arguably it would be better to put the color setting conditional logic in there instead but we show it in jq just to show off a bit.
aws ec2 describe-instances \ | jq -r '.Reservations[].Instances[] | if .State.Name == "running" then .COLOR="\u001B[32m" elif .State.Name == "stopped" then .COLOR="\u001B[31m" else .COLOR="\u001B[34" end | "\(.COLOR)\(.State.Name)\u001b[30m \(.InstanceId) \(.Tags[] | select(.Key == "Name") | .Value) \(.InstanceType) \(.LaunchTime) \(.PublicIpAddress)"' and with some additional empty Tags protection (shout out to David-Z), color assignment and a little more output control in awk (and some escaped CRs for clarity). Note that since Instance Name in particular might have spaces, we use tilde as a delimiter (too many bars already and colon and comma may pop up in Name and/or IP as well). aws ec2 describe-instances \ | jq -r '.Reservations[].Instances[] | "\(.State.Name)~\(.InstanceId)~\(first(.Tags[] | select(.Key == "Name").Value)? // "(none)")~\(.InstanceType) ~\(.LaunchTime)~\(.PublicIpAddress)" ' \ | awk -F '~' '{ ip=""; if($1=="running"){color=32;ip=$6} else if($1=="stopped"){color=31} else {color=34} ; printf "\033[%dm%4.4s\033[30m %-20.20s %16.16s %12.12s %s %s\n", color, $1, $2, $3, $4, $5, ip;}'
first(.Tags[] | select(.Key == "Name").Value)? // "(none)"
Shell scripts, heredocs, backgrounding, and jq team up to make a powerful, compact, and performant ensemble:
for name in A B C D E F do TF=/tmp/$name.$$.cmd cat <<EOS > $TF aws ec2 run-instances ... (various commands here) EOS bash $TF > /tmp/$name.response.json & # Background! done # Now wait for all those parallel executions to complete. # This is a very powerful and useful idiom: Easily launch a bunch of things # in the background with '&' and then wait for them all to finish: wait # When control returns here, A-F.response.json will have all JSON outputs which # may be accessed via jq
At some point you need to start capturing and working with return codes, stdout, and stderr from these tasks. You'll also want the ability to easily examine the entire JSON data structure -- and potentially modify it -- and save it without rereading it over and over. Recall the "inefficiency" above. It is a lot less elegant when the shell has to deal with more than one piece of data coming out of jq:
while true do read -r name ip <<<$(aws ec2 describe-instances --instance-ids $IID \ | jq -r '.Reservations[].Instances[] | "\(.State.Name) \(.PublicIpAddress)"' ...
In general, commands executed in a shell script need to do this:
command args 1>theStdout.txt 2>theStderr.txt ; returncode=$?
MYVAR=$(command args 2>&1) ; RC=$?
A second point of irritation with shell scripts is arguments and quoting. Simple string and integer arguments work fine but consider trying to pass this to a command
command --opt1 val --opt2 "val2 val3" --opt3 " \"val4\" " \ --opt4 ' "noInterpInsideSingleQuotes" ' ...
Lastly, complex workflows touching different parts of the JSON data lead to lots of individual jq executions, each reading JSON input, modifying it, and writing it back out to a tmp file to protect against clobbering the file if a failure occurs:
QQ=$(jq -r '.aaa.bbb') if [ condition ] ; then jq -r '.this.that | . + {"foo":"bar"}' $FILE > $FILE.tmp && mv $FILE.tmp $FILE else jq -r '.other | . + {"code":401}' $FILE > $FILE.tmp && mv $FILE.tmp $FILE fi jq -r '.status = "COMPLETE"' $FILE > $FILE.tmp && mv $FILE.tmp $FILE
python3 brings a couple of big assets to the table:
Running a synchronous command from python is easy:
import subprocess p1 = subprocess.run(['ls', '-l'], capture_output=True) if 0 == p1.returncode: print(p1.stdout) else: print("ERROR: ", p1.stderr) 0 'total 120 -rwxr-xr-x 1 user staff 57 Feb 1 15:29 args.sh -rwxr-xr-x 1 user staff 37 Feb 1 15:27 args.sh~ -rw-r ...
Let's compare the aws example:
import subprocess import json cmd = ['aws', 'ec2', 'run-instances', '--count', 1, '--instance-type', 'myType' '--tag-specifications', 'ResourceType=instance,Tags=[{Key=Name,Value=Hello}]' ] p1 = subprocess.run(cmd, capture_output=True) if 0 == p1.returncode: data = json.loads(p1.stdout) iid = data['Instances'][0]['InstanceId'] cmd2 = ['aws','ec2','describe-instances','--instance-ids', iid] while True: p2 = subprocess.run(cmd2, capture_output=True) if 0 == p2.returncode: rr = json.loads(p2.stdout) inst = rr['Reservations'][0]['Instances'][0] if "running" == inst['State']['Name']: ip = inst['PublicIpAddress'] break
It is also possible to run commands "in the background" by using the lower-level Popen command. With a little bit of extra work we can create a background group object upon which a "wait" can be emulated, as follows:
import subprocess class BG: def __init__(self): self.items = [] def launch(self, id, args): oo = {"id":id} oo['p'] = subprocess.Popen([str(x) for x in args], stdout=subprocess.PIPE, stderr=subprocess.PIPE) self.items.append(oo) def wait(self): for oo in self.items: (oo['stdout'], oo['stderr']) = oo['p'].communicate() oo['rc'] = oo['p'].returncode def results(self): return self.items bg = BG() for n in range(0,3): bg.launch(n, ['aws', 'ec2', 'run-instances', ... ]) # Three run-instances launched in background; wait for them: bg.wait() # This is the useful part: The bg results easily capture returncode, stdout, and stderr: for oo in bg.results(): print(oo['id'], oo['rc']) print('STDOUT: ', oo['stdout']) print('STDERR: ', oo['stderr'])
Like this? Dislike this? Let me know