lingvo.tools.count_records module

Tool to count number of records in a dataset.

Most other file formats have efficient ways to fetch the number of records in a dataset. However, some formats such as TFRecord requires you to essentially scan the files to perform this count.

This is a short little beam script that can leverage many machines to read all of the files in parallel potentially faster than a single machine script. It is recommended that for other file formats, simply reading the metadata available in their formats should work; this file should not really be extended to any other format that already has efficient ways of counting records.

lingvo.tools.count_records.main(argv)[source]