Count Blank Cells

Count blank cells on text-based tabular data.

Short examples:

$ ./cntblank --output-meta testdata/prefecture_jp.tsv
INFO[0000] start parsing with 2 columns.                 path=testdata/prefecture_jp.tsv
INFO[0000] finish parsing 48 lines to get 47 records with 2 columns. 0 errors detected.  path=testdata/prefecture_jp.tsv
# File          testdata/prefecture_jp.tsv      prefecture_jp.tsv    4884d04103df0fd8a9e792866ca0b870
# Field         2                               (has header)
# Record        47
seq     Name    #Blank  %Blank  MinLength       MaxLength       #Int    #Float  #Bool   #Time    Minimum Maximum #True   #False
1       都道府県コード  0       0.0000  2       2       47       47               1       47
2       都道府県        0       0.0000  3       4

Since file argument is optional, it accepts standard input.

$ curl -sL https://raw.githubusercontent.com/datasets/language-codes/master/data/ietf-language-tags.csv |
      ./cntblank --input-delimiter=, 2>/dev/null | cut -f1-6 | expand -t 15
seq            Name           #Blank         %Blank         MinLength      MaxLength
1              lang           0              0.0000         2              14
2              langType       0              0.0000         2              4
3              territory      213            0.3096         2              3
4              revGenDate     0              0.0000         10             10
5              defs           0              0.0000         1              1
6              file           0              0.0000         6              18

Also accepts Excel file whose extension is “.xlsx”.

$ ./cntblank --output-delimiter=, testdata/addrcode_jp.xlsx
INFO[0000] start parsing with 7 columns.
INFO[0000] finish parsing 1789 lines with 0 errors       path=testdata/addrcode_jp.xlsx
INFO[0000] get 1788 records with 7 columns
seq,Name,#Blank,%Blank,MinLength,MaxLength,#Int,#Float,#Bool,#Time,Minimum,Maximum,#True,#False
1,団体コード,0,0.0000,6,6,1788,1788,,,10006,473821,,
2,都道府県名(漢字),0,0.0000,3,4,,,,,,,,
3,市区町村名(漢字),47,0.0263,2,7,,,,,,,,
4,都道府県名(カナ),0,0.0000,4,7,,,,,,,,
5,市区町村名(カナ),47,0.0263,2,13,,,,,,,,
6,Column006,1788,1.0000,,,,,,,,,,
7,Column007,1788,1.0000,,,,,,,,,,

Output

It produces reports for each file in several formats. The report contains list of field summary in addition to file name and its checksum. Field summary is consists from field name and count of blanks, and data type indicator.

Although default report format is CSV, you can specify --output-format option explicitly. Given --output option and its file extension is either of “.html”, “.json”, or “.xlsx”, it automatically changes output format along with extension.

Output fields

Name Description
seq Sequential number which starts with one.
Name Field name from first header line, otherwise “ColumnNNN” where NNN is sequential number.
#Blank Count of blank cells.
%Blank Percentage of blank cells.
MinLength Minimum length of valid cells.
MaxLength Maximum length of valid cells.
#Int Count of integer type cells. This may be blank.
#Float Count of float type cells. This may be blank.
#Bool Count of bool type cells. This may be blank.
#Time Count of time type cells. This may be blank.
Minimum Minimum value after guessing data type.
Maximum Maximum value after guessing data type.
#True Count of cells which should be treated as boolean true.
#False Count of cells which should be treated as boolean false.

Note that “1” is interpreted as boolean true and “0” is also interpreted as boolean false. Therefore, if a column is integer field, “#True” represents the count of “1” and “#False” represents the count of “0” in the field. Since some buggy data files sometimes include “0” as null accidentally, this feature may help you to count up pseudo blank cells.

HTML output

$ cntblank --output=t.html testdata/prefecture_jp.tsv

HTML sample

JSON output

$ cntblank --output-format=json testdata/prefecture_jp.tsv | jq
[
  {
    "path": "testdata/prefecture_jp.tsv",
    "filename": "prefecture_jp.tsv",
    "md5": "4884d04103df0fd8a9e792866ca0b870",
    "header": true,
    "records": 47,
    "fields": [
      {
        "name": "都道府県コード",
        "blank": 0,
        "minLength": 2,
        "maxLength": 2,
        "minimum": 1,
        "maximum": 47,
        "typeInt": 47,
        "typeFloat": 47
      },
      {
        "name": "都道府県",
        "blank": 0,
        "minLength": 3,
        "maxLength": 4
      }
    ]
  }
]

Full Usage

--help shows the details.

usage: cntblank [<flags>] [<tabfile>...]

Count blank cells on text-based tabular data.

Flags:
      --help                   Show context-sensitive help (also try --help-long and --help-man).
  -v, --verbose                Set verbose mode on.
  -e, --input-encoding=INPUT-ENCODING
                               Input encoding.
  -E, --output-encoding=OUTPUT-ENCODING
                               Output encoding.
      --input-delimiter=INPUT-DELIMITER
                               Input field delimiter.
      --output-delimiter=OUTPUT-DELIMITER
                               Output field delmiter.
      --without-header         Tabular does not have header line.
      --output-without-header  Output report does not have header line.
      --strict                 Check column size strictly.
      --sheet=SHEET            Excel sheet number which starts with 1.
  -r, --recursive              Traverse directory recursively.
      --output-meta            Put meta information.
  -o, --output=OUTPUT          Output file.
      --output-format=OUTPUT-FORMAT
                               Output format.
      --version                Show application version.

Args:
  [<tabfile>]  Tabular data files.