webdump_tests

Testfiles for webdump
git clone git://git.codemadness.org/webdump_tests
Log | Files | Refs | README

cm_json2tsv.html (8696B)


      1 <!DOCTYPE html>
      2 <html dir="ltr" lang="en">
      3 <head>
      4 	<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
      5 	<meta http-equiv="Content-Language" content="en" />
      6 	<meta name="viewport" content="width=device-width" />
      7 	<meta name="keywords" content="json2tsv, JSON, tsv, TAB-Separated Value Format" />
      8 	<meta name="description" content="json2tsv: a JSON to TAB-Separated Value converter" />
      9 	<meta name="author" content="Hiltjo" />
     10 	<meta name="generator" content="Static content generated using saait: https://codemadness.org/saait.html" />
     11 	<title>json2tsv: a JSON to TSV converter - Codemadness</title>
     12 	<link rel="stylesheet" href="style.css" type="text/css" media="screen" />
     13 	<link rel="stylesheet" href="print.css" type="text/css" media="print" />
     14 	<link rel="alternate" href="atom.xml" type="application/atom+xml" title="Codemadness Atom Feed" />
     15 	<link rel="alternate" href="atom_content.xml" type="application/atom+xml" title="Codemadness Atom Feed with content" />
     16 	<link rel="icon" href="/favicon.png" type="image/png" />
     17 </head>
     18 <body>
     19 	<nav id="menuwrap">
     20 		<table id="menu" width="100%" border="0">
     21 		<tr>
     22 			<td id="links" align="left">
     23 				<a href="index.html">Blog</a> |
     24 				<a href="/git/" title="Git repository with some of my projects">Git</a> |
     25 				<a href="/releases/">Releases</a> |
     26 				<a href="gopher://codemadness.org">Gopherhole</a>
     27 			</td>
     28 			<td id="links-contact" align="right">
     29 				<span class="hidden"> | </span>
     30 				<a href="/donate/">Donate</a> |
     31 				<a href="feeds.html">Feeds</a> |
     32 				<a href="pgp.asc">PGP</a> |
     33 				<a href="mailto:hiltjo@AT@codemadness.DOT.org">Mail</a>
     34 			</td>
     35 		</tr>
     36 		</table>
     37 	</nav>
     38 	<hr class="hidden" />
     39 	<main id="mainwrap">
     40 		<div id="main">
     41 			<article>
     42 <header>
     43 	<h1>json2tsv: a JSON to TSV converter</h1>
     44 	<p>
     45 	<strong>Last modification on </strong> <time>2021-09-25</time>
     46 	</p>
     47 </header>
     48 
     49 <p>Convert JSON to TSV or separated output.</p>
     50 <p>json2tsv reads JSON data from stdin.  It outputs each JSON type to a TAB-
     51 Separated Value format per line by default.</p>
     52 <h2>TAB-Separated Value format</h2>
     53 <p>The output format per line is:</p>
     54 <pre><code>nodename&lt;TAB&gt;type&lt;TAB&gt;value&lt;LF&gt;
     55 </code></pre>
     56 <p>Control-characters such as a newline, TAB and backslash (\n, \t and \) are
     57 escaped in the nodename and value fields.  Other control-characters are
     58 removed.</p>
     59 <p>The type field is a single byte and can be:</p>
     60 <ul>
     61 <li>a for array</li>
     62 <li>b for bool</li>
     63 <li>n for number</li>
     64 <li>o for object</li>
     65 <li>s for string</li>
     66 <li>? for null</li>
     67 </ul>
     68 <p>Filtering on the first field "nodename" is easy using awk for example.</p>
     69 <h2>Features</h2>
     70 <ul>
     71 <li>Accepts all <strong>valid</strong> JSON.</li>
     72 <li>Designed to work well with existing UNIX programs like awk and grep.</li>
     73 <li>Straightforward and not much lines of code: about 475 lines of C.</li>
     74 <li>Few dependencies: C compiler (C99), libc.</li>
     75 <li>No need to learn a new (meta-)language for processing data.</li>
     76 <li>The parser supports code point decoding and UTF-16 surrogates to UTF-8.</li>
     77 <li>It does not output control-characters to the terminal for security reasons by
     78 default (but it has a -r option if needed).</li>
     79 <li>On OpenBSD it supports <a href="https://man.openbsd.org/pledge">pledge(2)</a> for syscall restriction:
     80 pledge("stdio", NULL).</li>
     81 <li>Supports setting a different field separator and record separator with the -F
     82 and -R option.</li>
     83 </ul>
     84 <h2>Cons</h2>
     85 <ul>
     86 <li>For the tool there is additional overhead by processing and filtering data
     87 from stdin after parsing.</li>
     88 <li>The parser does not do complete validation on numbers.</li>
     89 <li>The parser accepts some bad input such as invalid UTF-8
     90 (see <a href="https://tools.ietf.org/html/rfc8259#section-8.1">RFC8259 - 8.1. Character Encoding</a>).
     91 json2tsv reads from stdin and does not do assumptions about a "closed
     92 ecosystem" as described in the RFC.</li>
     93 <li>The parser accepts some bad JSON input and "extensions"
     94 (see <a href="https://tools.ietf.org/html/rfc8259#section-9">RFC8259 - 9. Parsers</a>).</li>
     95 <li>Encoded NUL bytes (\u0000) in strings are ignored.
     96 (see <a href="https://tools.ietf.org/html/rfc8259#section-9">RFC8259 - 9. Parsers</a>).
     97 "An implementation may set limits on the length and character contents of
     98 strings."</li>
     99 <li>The parser is not the fastest possible JSON parser (but also not the
    100 slowest).  For example: for ease of use, at the cost of performance all
    101 strings are decoded, even though they may be unused.</li>
    102 </ul>
    103 <h2>Why Yet Another JSON parser?</h2>
    104 <p>I wanted a tool that makes parsing JSON easier and work well from the shell,
    105 similar to <a href="https://stedolan.github.io/jq/">jq</a>.</p>
    106 <p>sed and grep often work well enough for matching some value using some regex
    107 pattern, but it is not good enough to parse JSON correctly or to extract all
    108 information: just like parsing HTML/XML using some regex is not good (enough)
    109 or a good idea :P.</p>
    110 <p>I didn't want to learn a new specific <a href="https://stedolan.github.io/jq/manual/#Builtinoperatorsandfunctions">meta-language</a> which jq has and wanted
    111 something simpler.</p>
    112 <p>While it is more efficient to embed this query language for data aggregation,
    113 it is also less simple. In my opinion it is simpler to separate this and use
    114 pattern-processing by awk or an other filtering/aggregating program.</p>
    115 <p>For the parser, there are many JSON parsers out there, like the efficient
    116 <a href="https://github.com/zserge/jsmn">jsmn parser</a>, however a few parser behaviours I want to have are:</p>
    117 <ul>
    118 <li>jsmn buffers data as tokens, which is very efficient, but also a bit
    119 annoying as an API as it requires another layer of code to interpret the
    120 tokens.</li>
    121 <li>jsmn does not handle decoding strings by default. Which is very efficient
    122 if you don't need parts of the data though.</li>
    123 <li>jsmn does not keep context of nested structures by default, so may require
    124 writing custom utility functions for nested data.</li>
    125 </ul>
    126 <p>This is why I went for a parser design that uses a single callback per "node"
    127 type and keeps track of the current nested structure in a single array and
    128 emits that.</p>
    129 <h2>Clone</h2>
    130 <pre><code>git clone git://git.codemadness.org/json2tsv
    131 </code></pre>
    132 <h2>Browse</h2>
    133 <p>You can browse the source-code at:</p>
    134 <ul>
    135 <li><a href="https://git.codemadness.org/json2tsv/">https://git.codemadness.org/json2tsv/</a></li>
    136 <li><a href="gopher://codemadness.org/1/git/json2tsv">gopher://codemadness.org/1/git/json2tsv</a></li>
    137 </ul>
    138 <h2>Download releases</h2>
    139 <p>Releases are available at:</p>
    140 <ul>
    141 <li><a href="https://codemadness.org/releases/json2tsv/">https://codemadness.org/releases/json2tsv/</a></li>
    142 <li><a href="gopher://codemadness.org/1/releases/json2tsv">gopher://codemadness.org/1/releases/json2tsv</a></li>
    143 </ul>
    144 <h2>Build and install</h2>
    145 <pre><code>$ make
    146 # make install
    147 </code></pre>
    148 <h2>Examples</h2>
    149 <p>An usage example to parse posts of the JSON API of <a href="https://www.reddit.com/">reddit.com</a> and format them
    150 to a plain-text list using awk:</p>
    151 <pre><code>#!/bin/sh
    152 curl -s -H 'User-Agent:' 'https://old.reddit.com/.json?raw_json=1&amp;limit=100' | \
    153 json2tsv | \
    154 awk -F '\t' '
    155 function show() {
    156 	if (length(o["title"]) == 0)
    157 		return;
    158 	print n ". " o["title"] " by " o["author"] " in r/" o["subreddit"];
    159 	print o["url"];
    160 	print "";
    161 }
    162 $1 == ".data.children[].data" {
    163 	show();
    164 	n++;
    165 	delete o;
    166 }
    167 $1 ~ /^\.data\.children\[\]\.data\.[a-zA-Z0-9_]*$/ {
    168 	o[substr($1, 23)] = $3;
    169 }
    170 END {
    171 	show();
    172 }'
    173 </code></pre>
    174 <h2>References</h2>
    175 <ul>
    176 <li>Sites:
    177 <ul>
    178 <li><a href="http://seriot.ch/parsing_json.php">seriot.ch - Parsing JSON is a Minefield</a></li>
    179 <li><a href="https://github.com/nst/JSONTestSuite">A comprehensive test suite for RFC 8259 compliant JSON parsers</a></li>
    180 <li><a href="https://json.org/">json.org</a></li>
    181 </ul>
    182 </li>
    183 <li>Current standard:
    184 <ul>
    185 <li><a href="https://tools.ietf.org/html/rfc8259">RFC8259 - The JavaScript Object Notation (JSON) Data Interchange Format</a></li>
    186 <li><a href="https://www.ecma-international.org/publications/standards/Ecma-404.htm">Standard ECMA-404 - The JSON Data Interchange Syntax (2nd edition (December 2017)</a></li>
    187 </ul>
    188 </li>
    189 <li>Historic standards:
    190 <ul>
    191 <li><a href="https://tools.ietf.org/html/rfc7159">RFC7159 - The JavaScript Object Notation (JSON) Data Interchange Format (obsolete)</a></li>
    192 <li><a href="https://tools.ietf.org/html/rfc7158">RFC7158 - The JavaScript Object Notation (JSON) Data Interchange Format (obsolete)</a></li>
    193 <li><a href="https://tools.ietf.org/html/rfc4627">RFC4627 - The JavaScript Object Notation (JSON) Data Interchange Format (obsolete, original)</a></li>
    194 </ul>
    195 </li>
    196 </ul>
    197 
    198 			</article>
    199 		</div>
    200 	</main>
    201 </body>
    202 </html>