Fsdb - a flat-text database for shell scripting
Fsdb, the flatfile streaming database is package of commands for manipulating flat-ASCII databases from shell scripts. Fsdb is useful to process medium amounts of data (with very little data you'd do it by hand, with megabytes you might want a real database). Fsdb was known as as Jdb from 1991 to Oct. 2008.
Fsdb is very good at doing things like:
extracting measurements from experimental output
examining data to address different hypotheses
joining data from different experiments
eliminating/detecting outliers
computing statistics on data (mean, confidence intervals, correlations, histograms)
reformatting data for graphing programs
Fsdb is built around the idea of a flat text file as a database. Fsdb files (by convention, with the extension .fsdb), have a header documenting the schema (what the columns mean), and then each line represents a database record (or row).
For example:
#fsdb experiment duration
ufs_mab_sys 37.2
ufs_mab_sys 37.3
ufs_rcp_real 264.5
ufs_rcp_real 277.9
Is a simple file with four experiments (the rows), each with a description, size parameter, and run time in the first, second, and third columns.
Rather than hand-code scripts to do each special case, Fsdb provides higher-level functions. Although it's often easy throw together a custom script to do any single task, I believe that there are several advantages to using Fsdb:
these programs provide a higher level interface than plain Perl, so
Fewer lines of simpler code:
dbrow '_experiment eq "ufs_mab_sys"' | dbcolstats duration
Picks out just one type of experiment and computes statistics on it, rather than:
while (<>) { split; $sum+=$F[1]; $ss+=$F[1]**2; $n++; }
$mean = $sum / $n; $std_dev = ...
in dozens of places.
the library uses names for columns, so
No more $F[1]
, use _duration
.
New or different order columns? No changes to your scripts!
Thus if your experiment gets more complicated with a size parameter, so your log changes to:
#fsdb experiment size duration
ufs_mab_sys 1024 37.2
ufs_mab_sys 1024 37.3
ufs_rcp_real 1024 264.5
ufs_rcp_real 1024 277.9
ufs_mab_sys 2048 45.3
ufs_mab_sys 2048 44.2
Then the previous scripts still work, even though duration is now the third column, not the second.
A series of actions are self-documenting (the provenance of processsing done to produce each output is recorded in comments).
No more wondering what hacks were used to compute the final data, just look at the comments at the end of the output.
For example, the commands
dbrow '_experiment eq "ufs_mab_sys"' | dbcolstats duration
add to the end of the output the lines # | dbrow _experiment eq "ufs_mab_sys" # | dbcolstats duration
The library is mature, supporting large datasets (more than 100GB), parallelism, corner cases, error handling, backed by an automated test suite.
No more puzzling about bad output because your custom script skimped on error checking.
No more memory thrashing when you try to sort ten million records.
Makes use of multiple cores in your computer when it can, because each pipeline component runs in parallel, and because key tools (dbsort, dbmapreduce) run in parlallel when possible.
Fsdb-2.x supports Perl scripting (in addition to shell scripting), with libraries to do Fsdb input and output, and easy support for pipelines. The shell script
dbcol name test1 | dbroweval '_test1 += 5;'
can be written in perl as:
dbpipeline(dbcol(qw(name test1)), dbroweval('_test1 += 5;'));
(The disadvantage is that you need to learn what functions Fsdb provides.)
Fsdb is built on flat-ASCII databases. By storing data in simple text files and processing it with pipelines it is easy to experiment (in the shell) and look at the output. To the best of my knowledge, the original implementation of this idea was /rdb
, a commercial product described in the book UNIX relational database management: application development in the UNIX environment by Rod Manis, Evan Schaffer, and Robert Jorgensen (1988 by Prentice Hall, and also at the web page http://www.rdb.com/). Fsdb is an incompatible re-implementation of their idea without any accelerated indexing or forms support. (But it's free, and probably has better statistics!).
Fsdb-2.x will exploit multiple processors or cores, and provides Perl-level support for input, output, and threaded-pipelines. (As of Fsdb-2.44 it no longer uses Perl threading, just processes, since they are faster.)
Installation instructions follow at the end of this document. Fsdb-2.x requires Perl 5.8 to run. All commands have manual pages and provide usage with the --help
option. All commands are backed by an automated test suite.
The most recent version of Fsdb is available on the web at http://www.isi.edu/~johnh/SOFTWARE/FSDB/index.html.
dbrowsdiff now has a -e option to specify the value to use in future-mode for the last row.
dbjoin now propagates types, rather than eating them.
Fsdb now uses the standard Perl build and installation from ExtUtil::MakeMaker(3), so the quick answer to installation is to type:
perl Makefile.PL
make
make test
sudo make install
Or, if you want to install it somewhere else, change the first line to
perl Makefile.PL PREFIX=$HOME
then the other commands (make; make test; make install
; but now without the sudo), and it will go in your home directory's bin, etc. (See ExtUtil::MakeMaker(3) for more details.)
Fsdb requires perl 5.8 or later.
A test-suite is available, run it with
make test
In the past, the ports existed for FreeBSD and MacOS. If someone running one of those OSes wants to contribute a new port, please let me know.
These programs are based on the idea storing data in simple ASCII files. A database is a file with one header line and then data or comment lines. For example:
#fsdb account passwd uid gid fullname homedir shell
johnh * 2274 134 John_Heidemann /home/johnh /bin/bash
greg * 2275 134 Greg_Johnson /home/greg /bin/bash
root * 0 0 Root /root /bin/bash
# this is a simple database
The header line must be first and begins with #fsdb
. There are rows (records) and columns (fields), just like in a normal database. Comment lines begin with #
. Column names are any string not containing spaces or single quote (although it is prudent to keep them alphanumeric with underscore).
Columns can optionally include type anntations by following name with :t where t is some type. (Types are not used in Perl, but are relevant in Python and Go Fsdb bindings.) Types use a subset of perl pack specifiers: c, s, l, q are signed 8, 16, 32, and 64-bit integers, f is a float, d is double float, a is utf-8 string, and > and < can force big or little endianness.
By default, columns are delimited by any amount of whitespace. With this default configuration, the contents of a field cannot contain whitespace. However, this limitation can be relaxed by changing the field separator as described below.
The big advantage of simple flat-text databases is that it is usually easy to massage data into this format, and it's reasonably easy to take data out of this format into other (text-based) programs, like gnuplot, jgraph, and LaTeX. Think Unix. Think pipes. (Or even output to Excel and HTML if you prefer.)
Since no-whitespace in columns was a problem for some applications, there's an option which relaxes this rule. You can specify the field separator in the table header with -F x
where x
is a code for the new field separator. A full list of codes is at dbfilealter(1), but two common special values are -F t
which is a separator of a single tab character, and -F S
, a separator of two spaces. Both allowing (single) spaces in fields. An example:
#fsdb -F S account passwd uid gid fullname homedir shell
johnh * 2274 134 John Heidemann /home/johnh /bin/bash
greg * 2275 134 Greg Johnson /home/greg /bin/bash
root * 0 0 Root /root /bin/bash
# this is a simple database
See dbfilealter(1) for more details. Regardless of what the column separator is for the body of the data, it's always whitespace in the header.
There's also a third format: a "list". Because it's often hard to see what's columns past the first two, in list format each "column" is on a separate line. The programs dblistize and dbcolize convert to and from this format, and all programs work with either formats. The command
dbfilealter -R C < DATA/passwd.fsdb
outputs:
#fsdb -R C account passwd uid gid fullname homedir shell
account: johnh
passwd: *
uid: 2274
gid: 134
fullname: John_Heidemann
homedir: /home/johnh
shell: /bin/bash
account: greg
passwd: *
uid: 2275
gid: 134
fullname: Greg_Johnson
homedir: /home/greg
shell: /bin/bash
account: root
passwd: *
uid: 0
gid: 0
fullname: Root
homedir: /root
shell: /bin/bash
# this is a simple database
# | dblistize
See dbfilealter(1) for more details.
A number of programs exist to manipulate databases. Complex functions can be made by stringing together commands with shell pipelines. For example, to print the home directories of everyone with ``john'' in their names, you would do:
cat DATA/passwd | dbrow '_fullname =~ /John/' | dbcol homedir
The output might be:
#fsdb homedir
/home/johnh
/home/greg
# this is a simple database
# | dbrow _fullname =~ /John/
# | dbcol homedir
(Notice that comments are appended to the output listing each command, providing an automatic audit log.)
In addition to typical database functions (select, join, etc.) there are also a number of statistical functions.
The real power of Fsdb is that one can apply arbitrary code to rows to do powerful things.
cat DATA/passwd | dbroweval '_fullname =~ s/(\w+)_(\w+)/$2,_$1/'
converts "John_Heidemann" into "Heidemann,_John". Not too much more work could split fullname into firstname and lastname fields.
(Or:
cat DATA/passwd | dbcolcreate sort | dbroweval -b 'use Fsdb::Support'
'_sort = _fullname; _sort =~ s/_/ /g; _sort = fullname_to_sort(_sort);'
An advantage of Fsdb is that you can talk about columns by name (symbolically) rather than simply by their positions. So in the above example, dbcol homedir
pulled out the home directory column, and dbrow '_fullname =~ /John/'
matched against column fullname.
In general, you can use the name of the column listed on the #fsdb
line to identify it in most programs, and _name to identify it in code.
Some alternatives for flexibility:
Numeric values identify columns positionally, numbering from 0. So 0 or _0 is the first column, 1 is the second, etc.
In code, _last_columnname gets the value from columname's previous row.
See dbroweval(1) for more details about writing code.
Enough said. I'll summarize the commands, and then you can experiment. For a detailed description of each command, see a summary by running it with the argument --help
(or -?
if you prefer.) Full manual pages can be found by running the command with the argument --man
, or running the Unix command man dbcol
or whatever program you want.
add columns to a database
set the column headings for a non-Fsdb file
select columns from a table
select rows from a table
sort rows based on a set of columns
compute the natural join of two tables
rename a column
merge two columns into one
split one column into two or more columns
split one column into multiple rows
"pivots" a file, converting multiple rows corresponding to the same entity into a single row with multiple columns.
check that db file doesn't have some common errors
compute statistics over a column (mean,etc.,optionally median)
group rows by some key value, then compute stats (mean, etc.) over each group (equivalent to dbmapreduce with dbcolstats as the reducer)
group rows (map) and then apply an arbitrary function to each group (reduce)
compare two samples distributions (mean/conf interval/T-test)
computing moving statistics over a column of data
compute Z-scores and T-scores over one column of data
compute the rank or percentile of a column
compute histograms over a column of data
compute the coefficient of correlation over several columns
drop rows selectively, keeping large changes and periodic samples
compute linear regression and correlation for two columns
compute a running sum over a column of data
count the number of rows (a subset of dbstats)
compute differences between a columns in each row of a table
number each row
run arbitrary Perl code on each row
count/eliminate identical rows (like Unix uniq(1))
compare fields on rows of a file (something like Unix diff(1))
pretty-print columns
convert between column or list format, or change the column separator
remove comments from a table
generate a script that sends form mail based on each row
(These programs convert data into fsdb. See their web pages for details.)
HTML tables to fsdb (assuming they're reasonably formatted).
http://ficus-www.cs.ucla.edu/ficus-members/geoff/kitrace.html
the output of SQL SELECT tables to db
spreadsheet tab-delimited files to db
(see man tcpdump(8) on any reasonable system)
XML input to fsdb, assuming they're very regular
(And out of fsdb:)
Comma-separated-value format from fsdb.
simple conversion of Fsdb to html tables
Many programs have common options:
Show basic usage.
When a command creates a new column like dbrowaccumulate's accum
, this option lets one override the default name of that new column.
where to put tmp files. Also uses environment variable TMPDIR, if -T is not specified. Default is /tmp.
Show basic usage.
Specify confidence interval FRACTION (dbcolstats, dbmultistats, etc.)
--element-separator S
Specify column separator S (dbcolsplittocols, dbcolmerge).
Enable debugging (may be repeated for greater effect in some cases).
Compute stats over all data (treating non-numbers as zeros). (By default, things that can't be treated as numbers are ignored for stats purposes)
Assume the data is pre-sorted. May be repeated to disable verification (saving a small amount of work).
give value E as the value for empty (null) records
Input data from file I.
Write data out to file O.
Use H as the full Fsdb header, rather than reading a header from then input. This option is particularly useful when using Fsdb under Hadoop, where split files don't have heades.
Skip logging the program in a trailing comment.
When giving Perl code (in dbrow and dbroweval) column names can be embedded if preceded by underscores. Look at dbrow(1) or dbroweval(1) for examples.)
Most programs run in constant memory and use temporary files if necessary. Exceptions are dbcolneaten, dbcolpercentile, dbmapreduce, dbmultistats, dbrowsplituniq.
A number of programs do sorting, or depend on defining an ordering of rows. Such programs use these standard sorting options:
sort in reverse order (high to low)
sort in normal order (low to high)
sort fields by type (numeric or leicographic), automatically
sort numerically
sort lexicographically
Take the raw data in DATA/http_bandwidth
, put a header on it (dbcoldefine size bw
), took statistics of each category (dbmultistats -k size bw
), pick out the relevant fields (dbcol size mean stddev pct_rsd
), and you get:
#fsdb size mean stddev pct_rsd
1024 1.4962e+06 2.8497e+05 19.047
10240 5.0286e+06 6.0103e+05 11.952
102400 4.9216e+06 3.0939e+05 6.2863
# | dbcoldefine size bw
# | /home/johnh/BIN/DB/dbmultistats -k size bw
# | /home/johnh/BIN/DB/dbcol size mean stddev pct_rsd
(The whole command was:
cat DATA/http_bandwidth |
dbcoldefine size |
dbmultistats -k size bw |
dbcol size mean stddev pct_rsd
all on one line.)
Then post-process them to get rid of the exponential notation by adding this to the end of the pipeline:
dbroweval '_mean = sprintf("%8.0f", _mean); _stddev = sprintf("%8.0f", _stddev);'
(Actually, this step is no longer required since dbcolstats now uses a different default format.)
giving:
#fsdb size mean stddev pct_rsd
1024 1496200 284970 19.047
10240 5028600 601030 11.952
102400 4921600 309390 6.2863
# | dbcoldefine size bw
# | dbmultistats -k size bw
# | dbcol size mean stddev pct_rsd
# | dbroweval { _mean = sprintf("%8.0f", _mean); _stddev = sprintf("%8.0f", _stddev); }
In a few lines, raw data is transformed to processed output.
Suppose you expect there is an odd distribution of results of one datapoint. Fsdb can easily produce a CDF (cumulative distribution function) of the data, suitable for graphing:
cat DB/DATA/http_bandwidth | \
dbcoldefine size bw | \
dbrow '_size == 102400' | \
dbcol bw | \
dbsort -n bw | \
dbrowenumerate | \
dbcolpercentile count | \
dbcol bw percentile | \
xgraph
The steps, roughly: 1. get the raw input data and turn it into fsdb format, 2. pick out just the relevant column (for efficiency) and sort it, 3. for each data point, assign a CDF percentage to it, 4. pick out the two columns to graph and show them
The first commercial program I wrote was a gradebook, so here's how to do it with Fsdb.
Format your data like DATA/grades.
#fsdb name email id test1
a a@ucla.example.edu 1 80
b b@usc.example.edu 2 70
c c@isi.example.edu 3 65
d d@lmu.example.edu 4 90
e e@caltech.example.edu 5 70
f f@oxy.example.edu 6 90
Or if your students have spaces in their names, use -F S
and two spaces to separate each column:
#fsdb -F S name email id test1
alfred aho a@ucla.example.edu 1 80
butler lampson b@usc.example.edu 2 70
david clark c@isi.example.edu 3 65
constantine drovolis d@lmu.example.edu 4 90
debrorah estrin e@caltech.example.edu 5 70
sally floyd f@oxy.example.edu 6 90
To compute statistics on an exam, do
cat DATA/grades | dbstats test1 |dblistize
giving
#fsdb -R C ...
mean: 77.5
stddev: 10.84
pct_rsd: 13.987
conf_range: 11.377
conf_low: 66.123
conf_high: 88.877
conf_pct: 0.95
sum: 465
sum_squared: 36625
min: 65
max: 90
n: 6
...
To do a histogram:
cat DATA/grades | dbcolhisto -n 5 -g test1
giving
#fsdb low histogram
65 *
70 **
75
80 *
85
90 **
# | /home/johnh/BIN/DB/dbhistogram -n 5 -g test1
Now you want to send out grades to the students by e-mail. Create a form-letter (in the file test1.txt):
To: _email (_name)
From: J. Random Professor <jrp@usc.example.edu>
Subject: test1 scores
_name, your score on test1 was _test1.
86+ A
75-85 B
70-74 C
0-69 F
Generate the shell script that will send the mail out:
cat DATA/grades | dbformmail test1.txt > test1.sh
And run it:
sh <test1.sh
The last two steps can be combined:
cat DATA/grades | dbformmail test1.txt | sh
but I like to keep a copy of exactly what I send.
At the end of the semester you'll want to compute grade totals and assign letter grades. Both fall out of dbroweval. For example, to compute weighted total grades with a 40% midterm/60% final where the midterm is 84 possible points and the final 100:
dbcol -rv total |
dbcolcreate total - |
dbroweval '
_total = .40 * _midterm/84.0 + .60 * _final/100.0;
_total = sprintf("%4.2f", _total);
if (_final eq "-" || ( _name =~ /^_/)) { _total = "-"; };' |
dbcolneaten
If you got the data originally from a spreadsheet, save it in "tab-delimited" format and convert it with tabdelim_to_db (run tabdelim_to_db -? for examples).
To convert the Unix password file to db:
cat /etc/passwd | sed 's/:/ /g'| \
dbcoldefine -F S login password uid gid gecos home shell \
>passwd.fsdb
To convert the group file
cat /etc/group | sed 's/:/ /g' | \
dbcoldefine -F S group password gid members \
>group.fsdb
To show the names of the groups that div7-members are in (assuming DIV7 is in the gecos field):
cat passwd.fsdb | dbrow '_gecos =~ /DIV7/' | dbcol login gid | \
dbjoin -i - -i group.fsdb gid | dbcol login group
Which Fsdb programs are the most complicated (based on number of test cases)?
ls TEST/*.cmd | \
dbcoldefine test | \
dbroweval '_test =~ s@^TEST/([^_]+).*$@$1@' | \
dbrowuniq -c | \
dbsort -nr count | \
dbcolneaten
(Answer: dbmapreduce, then dbcolstats, dbfilealter and dbjoin.)
Stats on an exam (in $FILE
, where $COLUMN
is the name of the exam)?
cat $FILE | dbcolstats -q 4 $COLUMN <$FILE | dblistize | dbstripcomments
cat $FILE | dbcolhisto -g -n 20 $COLUMN | dbcolneaten | dbstripcomments
Merging a the hw1 column from file hw1.fsdb into grades.fsdb assuming there's a common student id in column "id":
dbcol id hw1 <hw1.fsdb >t.fsdb
dbjoin -a -e - grades.fsdb t.fsdb id | \
dbsort name | \
dbcolneaten >new_grades.fsdb
Merging two fsdb files with the same rows:
cat file1.fsdb file2.fsdb >output.fsdb
or if you want to clean things up a bit
cat file1.fsdb file2.fsdb | dbstripextraheaders >output.fsdb
or if you want to know where the data came from
for i in 1 2
do
dbcolcreate source $i < file$i.fsdb
done >output.fsdb
(assumes you're using a Bourne-shell compatible shell, not csh).
As with any tool, one should (which means must) understand the limits of the tool.
All Fsdb tools should run in constant memory. In some cases (such as dbcolstats with quartiles, where the whole input must be re-read), programs will spool data to disk if necessary.
Most tools buffer one or a few lines of data, so memory will scale with the size of each line. (So lines with many columns, or when columns have lots data, may cause large memory consumption.)
All Fsdb tools should run in constant or at worst n log n
time.
All Fsdb tools use normal Perl math routines for computation. Although I make every attempt to choose numerically stable algorithms (although I also welcome feedback and suggestions for improvement), normal rounding due to computer floating point approximations can result in inaccuracies when data spans a large range of precision. (See for example the dbcolstats_extrema test cases.)
Any requirements and limitations of each Fsdb tool is documented on its manual page.
If any Fsdb program violates these assumptions, that is a bug that should be documented on the tool's manual page or ideally fixed.
Fsdb does depend on Perl's correctness, and Perl (and Fsdb) have some bugs. Fsdb should work on perl from version 5.10 onward.
There have been four major versions of Fsdb: fsdb-0.x was begun in 1991 for my personal use. Fsdb 1.0 is a complete re-write of the pre-1995 versions, and was distributed from 1995 to 2007. Fsdb 2.0 is a significant re-write of the 1.x versions to systematically use a library and threads (although threads were replaced with full processes in 2.44). Fsdb 3.0 in 2022 adds type specifiers to the schema, mostly to support use in languages with stronger typing (like Python, Go, and C).
Fsdb (in its various forms) has been used extensively by its author since 1991. Since 1995 it's been used by two other researchers at UCLA and several at ISI. In February 1998 it was announced to the Internet. Since then it has found a few users, some outside where I work.
Major changes:
I've thought about fsdb-2.0 for many years, but it was started in earnest in 2007. Fsdb-2.0 has the following goals:
While fsdb is great on the Unix command line as a pipeline between programs, it should also be possible to set it up to run in a single process. And if it does so, it should be able to avoid serializing and deserializing (converting to and from text) data between each module. (Accomplished in fsdb-2.0: see dbpipeline, although still needs tuning.)
Fsdb's roots go back to perl4 and 1991, so the fsdb-1.x library is very, very crufty. More than just being ugly (but it was that too), this made things reading from one format file and writing to another the application's job, when it should be the library's. (Accomplished in fsdb-1.15 and improved in 2.0: see Fsdb::IO.)
Because fsdb modules were added as needed over 10 years, sometimes the module APIs became inconsistent. (For example, the 1.x dbcolcreate
required an empty value following the name of the new column, but other programs specify empty values with the -e
argument.) We should smooth over these inconsistencies. (Accomplished as each module was ported in 2.0 through 2.7.)
Given a clean IO API, the distinction between "colized" and "listized" fsdb files should go away. Any program should be able to read and write files in any format. (Accomplished in fsdb-2.1.)
Fsdb-2.0 preserves backwards compatibility where possible, but breaks it where necessary to accomplish the above goals. In August 2008, Fsdb-2.7 was declared preferred over the 1.x versions. Benchmarking in 2013 showed that threading performed much worse than just using pipes, because Perl's requirements for data that is shared between multiple threads is quite heavyweight. Fsdb-2.44 therefore uses threading "style", but implemented with processes (via my "Freds" library).
Fsdb's use of Unix pipelines means Fsdb automatically benefits for multiprocessor computers---each pipeline stage can run on a separate core. In addition, compute-intensive Fsdb modules like dbsort and dbmapreduce are explicitly multi-process and will use as many cores as they can, up to the number of cores on the local computer.
Although Fsdb takes advanatage of as much parallelism as it can, a five stage pipeline won't necessarily saturate five cores. Pipeline stages almost always have different amounts of work to do, and some stages are often data limited. (Dbsort is attempts as much parallelism as it can, and can run 10-way parallel or more over a large enough input dataset. But it cannot sustain high parallelism because of the requirement that it produce one global output.)
There are two motiviations for adding optional typing to Fsdb. First, languages such as Python and Go would really like type information. As of 2022 there are now users of those languages, so the basic system should support them.
Second, while pure text is flexible, it's very inefficient---converting numbers to and from decimal is thousands of instructions, and binary encodings are often much smaller than text. In the future, I would love to have a flag that enables a binary encoding.
Typing is optional---omitting types is never wrong.
One somewhat odd thing about typing is that we reuse the Perl pack definitions of types, so q (for "quadword") stands for 64-bit integer. These are perhaps not the most mnemonic choices in 2022, but I would rather pick someone's existing set than try to define my own.
Fsdb includes code ported from Geoff Kuenning (Fsdb::Support::TDistribution
).
Fsdb contributors: Ashvin Goel goel@cse.oge.edu, Geoff Kuenning geoff@fmg.cs.ucla.edu, Vikram Visweswariah visweswa@isi.edu, Kannan Varadahan kannan@isi.edu, Lars Eggert larse@isi.edu, Arkadi Gelfond arkadig@dyna.com, David Graff graff@ldc.upenn.edu, Haobo Yu haoboy@packetdesign.com, Pavlin Radoslavov pavlin@catarina.usc.edu, Graham Phillips, Yuri Pradkin, Alefiya Hussain, Ya Xu, Michael Schwendt, Fabio Silva fabio@isi.edu, Jerry Zhao zhaoy@isi.edu, Ning Xu nxu@aludra.usc.edu, Martin Lukac mlukac@lecs.cs.ucla.edu, Xue Cai, Michael McQuaid, Christopher Meng, Calvin Ardi, H. Merijn Brand, Lan Wei, Hang Guo, Wes Hardaker, Erica Stutz.
Fsdb includes datasets contributed from NIST (DATA/nist_zarr13.fsdb), from http://www.itl.nist.gov/div898/handbook/eda/section4/eda4281.htm, the NIST/SEMATECH e-Handbook of Statistical Methods, section 1.4.2.8.1. Background and Data. The source is public domain, and reproduced with permission.
As stated in the introduction, Fsdb is an incompatible reimplementation of the ideas found in /rdb
. By storing data in simple text files and processing it with pipelines it is easy to experiment (in the shell) and look at the output. The original implementation of this idea was /rdb, a commercial product described in the book UNIX relational database management: application development in the UNIX environment by Rod Manis, Evan Schaffer, and Robert Jorgensen (and also at the web page http://www.rdb.com/).
While Fsdb is inspired by Rdb, it includes no code from it, and Fsdb makes several different design choices. In particular: rdb attempts to be closer to a "real" database, with provision for locking, file indexing. Fsdb focuses on single user use and so eschews these choices. Rdb also has some support for interactive editing. Fsdb leaves editing to text editors like emacs or vi.
In August, 2002 I found out Carlo Strozzi extended RDB with his package NoSQL http://www.linux.it/~carlos/nosql/. According to Mr. Strozzi, he implemented NoSQL in awk to avoid Perl start-up costs in RDB. Although I haven't found Perl startup overhead to be a big problem on my platforms (from old Sparcstation IPCs to 2GHz Pentium-4s), you may want to evaluate his system. The Linux Journal has a description of NoSQL at http://www.linuxjournal.com/article/3294. It seems quite similar to Fsdb. Like /rdb, NoSQL supports indexing (not present in Fsdb). Fsdb appears to have richer support for statistics, and, as of Fsdb-2.x, its support for Perl threading may support faster performance (one-process, less serialization and deserialization).
Versions prior to 1.0 were released informally on my web page but were not announced.
started for my own research use
first check-in to RCS
parts now require perl5
adds autoconf support and a test script.
support for double space field separators, better tests
minor changes and release on comp.lang.perl.announce
adds median and quartile options to dbstats
adds dmalloc_to_db converter
fixes some warnings
dbjoin now can run on unsorted input
fixes a dbjoin bug
some more tests in the test suite
improves error messages (all should now report the program that makes the error)
fixed a bug in dbstats output when the mean is zero
fsdb-announce@heidemann.la.ca.us
and fsdb-talk@heidemann.la.ca.us
To subscribe to either, send mail to fsdb-announce-request@heidemann.la.ca.us
or fsdb-talk-request@heidemann.la.ca.us
with "subscribe" in the BODY of the message.
larse@isi.edu
nxu@aludra.usc.edu
.2.0, 25-Jan-08 --- a quiet 2.0 release (gearing up towards complete)
dbstats
(renamed dbcolstats), dbcolrename, dbcolcreate,It also provides perl function aliases for the internal modules, so a string of fsdb commands in perl are nearly as terse as in the shell:
use Fsdb::Filter::dbpipeline qw(:all);
dbpipeline(
dbrow(qw(name test1)),
dbroweval('_test1 += 5;')
);
-
(the default empty value) for statistics it cannot compute (for example, standard deviation if there is only one row), instead of the old mix of -
and "na".-t mean,stddev
option is now --tmean mean --tstddev stddev
. See dbcolstatscores for details.-e
option.n
output (except without differentiating numeric/non-numeric input), or the equivalent of dbstripcomments | wc -l
.-i
option to include non-matches is now renamed -a
, so as to not conflict with the new standard option -i
for input file.2.1, 6-Apr-08 --- another alpha 2.0, but now all converted programs understand both listize and colize format
The old dbjoin argument -i
is now -a
or <--type=outer>.
A minor change: comments in the source files for dbjoin are now intermixed with output rather than being delayed until the end.
-e
option (to avoid end-of-line spaces) is now -E
to avoid conflicts with the standard empty field argument.-e
option is now -E
to avoid conflicts. And its -n
, -s
, and -w
are now -N
, -S
, and -W
to correspond.Fsdb::IO
now understand both list-format and column-format data, so all converted programs can now automatically read either format. This capability was one of the milestone goals for 2.0, so yea!Release 2.2 is another 2.x alpha release. Now most of the commands are ported, but a few remain, and I plan one last incompatible change (to the file header) before 2.x final.
shifting more old programs to Perl modules. New in 2.2: dbrowaccumulate, dbformmail. dbcolmovingstats. dbrowuniq. dbrowdiff. dbcolmerge. dbcolsplittocols. dbcolsplittorows. dbmapreduce. dbmultistats. dbrvstatdiff. Also dbrowenumerate exists only as a front-end (command-line) program.
The following programs have been dropped from fsdb-2.x: dbcoltighten, dbfilesplit, dbstripextraheaders, dbstripleadingspace.
combined_log_format_to_db to convert Apache logfiles
Options to dbrowdiff are now -B and -I, not -a and -i.
dbstripcomments is now dbfilestripcomments.
dbcolneaten better handles empty columns; dbcolhisto warning suppressed (actually a bug in high-bucket handling).
dbmultistats now requires a -k
option in front of the key (tag) field, or if none is given, it will group by the first field (both like dbmapreduce).
dbmultistats with quantile option doesn't work currently.
dbcoldiff is renamed dbrvstatdiff.
dbformmail was leaving its log message as a command, not a comment. Oops. No longer.
Another alpha release, this one just to fix the critical dbjoin bug listed below (that happens to have blocked my MP3 jukebox :-).
Dbsort no longer hangs if given an input file with no rows.
Dbjoin now works with unsorted input coming from a pipeline (like stdin). Perl-5.8.8 has a bug (?) that was making this case fail---opening stdin in one thread, reading some, then reading more in a different thread caused an lseek which works on files, but fails on pipes like stdin. Go figure.
The dbjoin fix also fixed dbmultistats -q (it now gives the right answer). Although a new bug appeared, messages like: Attempt to free unreferenced scalar: SV 0xa9dd0c4, Perl interpreter: 0xa8350b8 during global destruction. So the dbmultistats_quartile test is still disabled.
Another alpha release, mostly to fix minor usability problems in dbmapreduce and client functions.
dbrow now defaults to running user supplied code without warnings (as with fsdb-1.x). Use --warnings
or -w
to turn them back on.
dbroweval can now write different format output than the input, using the -m
option.
dbmapreduce emits warnings on perl 5.10.0 about "Unbalanced string table refcount" and "Scalars leaked" when run with an external program as a reducer.
dbmultistats emits the warning "Attempt to free unreferenced scalar" when run with quartiles.
In each case the output is correct. I believe these can be ignored.
dbmapreduce no longer logs a line for each reducer that is invoked.
Another alpha release, fixing more minor bugs in dbmapreduce
and lossage in Fsdb::IO
.
dbmapreduce can now tolerate non-map-aware reducers that pass back the key column in put. It also passes the current key as the last argument to external reducers.
Fsdb::IO::Reader, correctly handle -header
option again. (Broken since fsdb-2.3.)
Another alpha release, needed to fix DaGronk. One new port, small bug fixes, and important fix to dbmapreduce.
shifting more old programs to Perl modules. New in 2.2: dbcolpercentile.
--rank
to require ranking instead of -r
. Also, --ascending
and --descending
can now be specified separately, both for --percentile
and --rank
.Sigh, the sense of the --warnings option in dbrow was inverted. No longer.
I found and fixed the string leaks (errors like "Unbalanced string table refcount" and "Scalars leaked") in dbmapreduce and dbmultistats. (All IO::Handle
s in threads must be manually destroyed.)
The -C
option to specify the column separator in dbcolsplittorows now works again (broken since it was ported).
2.7, 30-Jul-08 beta
The beta release of fsdb-2.x. Finally, all programs are ported. As statistics, the number of lines of non-library code doubled from 7.5k to 15.5k. The libraries are much more complete, going from 866 to 5164 lines. The overall number of programs is about the same, although 19 were dropped and 11 were added. The number of test cases has grown from 116 to 175. All programs are now in perl-5, no more shell scripts or perl-4. All programs now have manual pages.
Although this is a major step forward, I still expect to rename "jdb" to "fsdb".
shifting more old programs to Perl modules. New in 2.7: dbcolscorellate. dbcolsregression. cgi_to_db. dbfilevalidate. db_to_csv. csv_to_db, db_to_html_table, kitrace_to_db, tcpdump_to_db, tabdelim_to_db, ns_to_db.
The following programs have been dropped from fsdb-2.x: db2dcliff, dbcolmultiscale, crl_to_db. ipchain_logs_to_db. They may come back, but seemed overly specialized. The following program dbrowsplituniq was dropped because it is superseded by dbmapreduce. dmalloc_to_db was dropped pending a test cases and examples.
dbfilevalidate now has a -c
option to correct errors.
html_table_to_db provides the inverse of db_to_html_table.
Change header format, preserving forwards compatibility.
Complete editing pass over the manual, making sure it aligns with fsdb-2.x.
The header of fsdb files has changed, it is now #fsdb, not #h (or #L) and parsing of -F and -R are also different. See dbfilealter for the new specification. The v1 file format will be read, compatibly, but not written.
dbmapreduce now tolerates comments that precede the first key, instead of failing with an error message.
Still in beta; just a quick bug-fix for dbmapreduce.
dbmapreduce now generates plausible output when given no rows of input.
Still in beta, but picking up some bug fixes.
dbmapreduce now generates plausible output when given no rows of input.
dbroweval the warnings option was backwards; now corrected. As a result, warnings in user code now default off (like in fsdb-1.x).
dbcolpercentile now defaults to assuming the target column is numeric. The new option -N
allows selection of a non-numeric target.
dbcolscorrelate now includes --sample
and --nosample
options to compute the sample or full population correlation coefficients. Thanks to Xue Cai for finding this bug.
Still in beta, but picking up some bug fixes.
html_table_to_db is now more aggressive about filling in empty cells with the official empty value, rather than leaving them blank or as whitespace.
dbpipeline now catches failures during pipeline element setup and exits reasonably gracefully.
dbsubprocess now reaps child processes, thus avoiding running out of processes when used a lot.
Finally, a full (non-beta) 2.x release!
Jdb has been renamed Fsdb, the flatfile-streaming database. This change affects all internal Perl APIs, but no shell command-level APIs. While Jdb served well for more than ten years, it is easily confused with the Java debugger (even though Jdb was there first!). It also is too generic to work well in web search engines. Finally, Jdb stands for ``John's database'', and we're a bit beyond that. (However, some call me the ``file-system guy'', so one could argue it retains that meeting.)
If you just used the shell commands, this change should not affect you. If you used the Perl-level libraries directly in your code, you should be able to rename "Jdb" to "Fsdb" to move to 2.12.
The jdb-announce list not yet been renamed, but it will be shortly.
With this release I've accomplished everything I wanted to in fsdb-2.x. I therefore expect to return to boring, bugfix releases.
dbrowaccumulate now treats non-numeric data as zero by default.
Fixed a perl-5.10ism in dbmapreduce that breaks that program under 5.8. Thanks to Martin Lukac for reporting the bug.
Improved documentation for dbmapreduce's -f
option.
dbcolmovingstats how computes a moving standard deviation in addition to a moving mean.
Fix a make install bug reported by Shalindra Fernando.
Another minor release bug: on some systems programize_module looses executable permissions. Again reported by Shalindra Fernando.
Typo in the dbroweval manual fixed.
There is no longer a comment line to label columns in dbcolneaten, instead the header line is tweaked to line up. This change restores the Jdb-1.x behavior, and means that repeated runs of dbcolneaten no longer add comment lines each time.
It turns out dbcolneaten was not correctly handling trailing spaces when given the -E
option to suppress them. This regression is now fixed.
dbroweval(1) can now handle direct references to the last row via $lfref, a dubious but now documented feature.
Separators set with -C
in dbcolmerge and dbcolsplittocols were not properly setting the heading, and null fields were not recognized. The first bug was reported by Martin Lukac.
Documentation for Fsdb::IO::Reader has been improved.
The package should now be PGP-signed.
Internal improvements to debugging output and robustness of dbmapreduce and dbpipeline. TEST/dbpipeline_first_fails.cmd re-enabled.
Logging for dbmapreduce with code refs is now stable (it no longer includes a hex pointer to the code reference).
Better handling of mixed blank lines in Fsdb::IO::Reader (see test case dbcolize_blank_lines.cmd).
html_table_to_db now handles multi-line input better, and handles tables with COLSPAN.
dbpipeline now cleans up threads in an eval
to prevent "cannot detach a joined thread" errors that popped up in perl-5.10. Hopefully this prevents a race condition that causes the test suites to hang about 20% of the time (in dbpipeline_first_fails).
dbmapreduce now detects and correctly fails when the input and reducer have incompatible field separators.
dbcolstats, dbcolhisto, dbcolscorrelate, dbcolsregression, and dbrowcount now all take an -F
option to let one specify the output field separator (so they work better with dbmapreduce).
An omitted -k
from the manual page of dbmultistats is now there. Bug reported by Unkyu Park.
Fsdb::IO::Writer now no longer fails with -outputheader => never (an obscure bug).
Fsdb (in the warnings section) and dbcolstats now more carefully document how they handle (and do not handle) numerical precision problems, and other general limits. Thanks to Yuri Pradkin for prompting this documentation.
Fsdb::Support::fullname_to_sortkey
is now restored from Jdb
.
Documention for multiple styles of input approaches (including performance description) added to Fsdb::IO.
dbmerge now correctly handles n-way merges. Bug reported by Yuri Pradkin.
dbcolneaten now defaults to not padding the last column.
dbrowenumerate now takes -N NewColumn to give the new column a name other than "count". Feature requested by Mike Rouch in January 2005.
New program dbcolcopylast copies the last value of a column into a new column copylast_column of the next row. New program requested by Fabio Silva; useful for converting dbmultistats output into dbrvstatdiff input.
Several tools (particularly dbmapreduce and dbmultistats) would report errors like "Unbalanced string table refcount: (1) for "STDOUT" during global destruction" on exit, at least on certain versions of Perl (for me on 5.10.1), but similar errors have been off-and-on for several Perl releases. Although I think my code looked OK, I worked around this problem with a different way of handling standard IO redirection.
Documentation to dbrvstatdiff was changed to use "sd" to refer to standard deviation, not "ss" (which might be confused with sum-of-squares).
This documentation about dbmultistats was missing the -k option in some cases.
dbmapreduce was failing on MacOS-10.6.3 for some tests with the error
dbmapreduce: cannot run external dbmapreduce reduce program (perl TEST/dbmapreduce_external_with_key.pl)
The problem seemed to be only in the error, not in operation. On MacOS, the error is now suppressed. Thanks to Alefiya Hussain for providing access to a Mac system that allowed debugging of this problem.
The csv_to_db command requires an external Perl library (Text::CSV_XS). On computers that lack this optional library, previously Fsdb would configure with a warning and then test cases would fail. Now those test cases are skipped with an additional warning.
The test suite now supports alternative valid output, as a hack to account for last-digit floating point differences. (Not very satisfying :-(
dbcolstats output for confidence intervals on very large datasets has changed. Previously it failed for more than 2^31-1 records, and handling of T-Distributions with thousands of rows was a bit dubious. Now datasets with more than 10000 are considered infinitely large and hopefully correctly handled.
The dbfilealter command had a --correct
option to work-around from incompatible field-separators, but it did nothing. Now it does the correct but sad, data-loosing thing.
The dbmultistats command previously failed with an error message when invoked on input with a non-default field separator. The root cause was the underlying dbmapreduce that did not handle the case of reducers that generated output with a different field separator than the input. We now detect and repair incompatible field separators. This change corrects a problem originally documented and detected in Fsdb-2.20. Bug re-reported by Unkyu Park.
kitrace_to_db now supports a --utc option, which also fixes this test case for users outside of the Pacific time zone. Bug reported by David Graff, and also by Peter Desnoyers (within a week of each other :-)
xml_to_db can convert simple, very regular XML files into Fsdb.
dbfilepivot "pivots" a file, converting multiple rows corresponding to the same entity into a single row with multiple columns.
Bugs fixed in Fsdb::IO::Reader(3) manual page.
Fixed problems where dbcolstats was truncating floating point numbers when sorting. This strange behavior happens as of perl-5.14.2 and it seems like a Perl bug. I've worked around it for the test suites, but I'm a bit nervous.
csv_to_db now reports errors in CVS input with real diagnostics.
dbcolmovingstats can now compute median, when given the -m
option.
dbcolmovingstats non-numeric handling (the -a
option) now works properly.
The internal t/test_command.t test framework is now documented.
dbrowuniq now correctly handles the case where there is no input (previously it output a blank line, which is a malformed fsdb file). Thanks to Yuri Pradkin for reporting this bug.
Fixed a number of minor release problems (wrong permissions, old FSF address, etc.) found by rpmlint.
Tweaked the RPM spec.
Modified Makefile.PL to fail gracefully on Perl installations that lack threads. (Without this fix, I get massive failures in the non-ithreads test system.)
Removed unicode character in documention of dbcolscorrelated so pod tests will pass. (Sigh, that should work :-( )
Fixed test suite failures on 5 tests (dbcolcreate_double_creation was the first) due to Carp's addition of a period. This problem was breaking Fsdb on perl-5.17. Thanks to Michael McQuaid for helping diagnose this problem.
The test suite now prints out the names of tests it tries.
Documentation fixes: typos in dbcolscorrelated, bugs in dbfilepivot, clarification for comment handling in Fsdb::IO::Reader.
Previously dbfilepivot assumed the input was grouped by keys and didn't very that pre-condition. Now there is no pre-condition (it will sort the input by default), and it checks if the invariant is violated.
Previously dbfilepivot failed if the input had comments (oops :-); no longer.
Now dbrowuniq has the -L
option to preserve the last unique row (instead of the first), a common idiom.
New dbfilediff does fsdb-aware file differencing. It does not do smart intuition of add/removes like Unix diff(1), but it does know about columns, and with -E
, it does numeric-aware differences.
Test suites that are numeric now use dbfilediff to do numeric-aware comparisons, so the test suite should now be robust to slightly different computers and operating systems and compilers than exactly what I use.
dbfilediff and dbrowuniq now supports the -N
option to give the new column a different name. (And a test cases where this duplication mattered have been fixed.)
dbrvstatdiff now show the t-test breakpoint with a reasonable number of floating point digits.
Fixed a numerical stability problem in the dbroweval_last test case.
Documention for dbjoin now includes resource requirements.
Default memory usage for dbsort is now about 256MB. (The world keeps moving forward.)
dbmerge now does merging in parallel. As a side-effect, dbsort should be faster when input overflows memory. The level of parallelism can be limited with the --parallelism
option. (There is more work to do here, but we're off to a start.)
Fsdb temporary files are now created more securely (with File::Temp).
Programs that sort or merge on fields (dbmerge2, dbmerge, dbsort, dbjoin) now report an error if no fields on which to join or merge are given.
Parallelism in dbmerge is should now be more consistent, with less starting and stopping.
--xargs
option lets one give input filenames on standard input, rather than the command line. This feature paves the way for faster dbsort for large inputs (by pipelining sorting and merging), expected in the next release.--xargs
. This problem is now fixed.Configure now rejects Windows since tests seem to hang on some versions of Windows. (I would love help from a Windows developer to get this problem fixed, but I cannot do it.) See https://rt.cpan.org/Ticket/Display.html?id=84201.
All programs that use temporary files (dbcolpercentile, dbcolscorrelate, dbcolstats, dbcolstatscores) now take the -T
option and set the temporary directory consistently.
In addition, error messages are better when the temporary directory has problems. Problem reported by Liang Zhu.
dbmapreduce was failing with external, map-reduce aware reducers (when invoked with -M and an external program). (Sigh, did this case ever work?) This case should now work. Thanks to Yuri Pradkin for reporting this bug (in 2011).
Fixed perl-5.10 problem with dbmerge. Thanks to Yuri Pradkin for reporting this bug (in 2013).
Actually in 2.38, the Fedora .spec got cleaner dependencies. Suggestion from Christopher Meng via https://bugzilla.redhat.com/show_bug.cgi?id=877096.
Fsdb files are now explicitly set into UTF-8 encoding, unless one specifies -encoding
to Fsdb::IO
.
dbrowuniq now supports -I
for incremental counting.
dbsort now has more respect for a user-given temporary directory; it no longer is ignored for merging.
dbrowuniq now has options to output the first, last, and both first and last rows of a run (-F
, -L
, and -B
).
dbrowuniq now correctly handles -N
. Sigh, it didn't work before.
Documentation to dbrvstatdiff improved (inspired by questions from Qian Kun).
dbrowuniq no longer duplicates singleton unique lines when outputting both (with -B
).
Add missing XML::Simple
dependency to Makefile.PL.
Tests now show the diff of the failing output if run with make test TEST_VERBOSE=1
.
dbroweval now includes documentation for how to output extra rows. Suggestion from Yuri Pradkin.
Several improvements to the Fedora package from Michael Schwendt via https://bugzilla.redhat.com/show_bug.cgi?id=877096, and from the harsh master that is rpmlint. (I am stymied at teaching it that "outliers" is spelled correctly. Maybe I should send it Schneier's book. And an unresolvable invalid-spec-name lurks in the SRPM.)
Documentation to dbjoin improved to better memory usage. (Based on problem report by Lin Quan.)
The .spec is now perl-Fsdb.spec to satisfy rpmlint. Thanks to Christopher Meng for a specific bug report.
Test dbroweval_last.cmd no longer has a column that caused failures because of numerical instability.
Some tests now better handle bugs in old versions of perl (5.10, 5.12). Thanks to Calvin Ardi for help debugging this on a Mac with perl-5.12, but the fix should affect other platforms.
Changed the sort on TEST/dbsort_merge.cmd to strings (from numerics) so we're less susceptible to false test-failures due to floating point IO differences.
Yet more parallelism in dbmerge: new "endgame-mode" builds a merge tree of processes at the end of large merge tasks to get maximally parallelism. Currently this feature is off by default because it can hang for some inputs. Enable this experimental feature with --endgame
.
Fsdb::IO
now handles being given IO::Pipe
objects (as exercised by dbmerge).
Handling of NamedTmpfiles now supports concurrency. This fix will hopefully fix occasional "Use of uninitialized value $_ in string ne at ...NamedTmpfile.pm line 93." errors.
Fsdb now requires perl 5.10. This is a bug fix because some test cases used to require it, but this fact was not properly documented. (Back-porting to 5.008 would require removing all //
operators.)
Fsdb now handles automatic compression of file contents. Enable compression with dbfilealter -Z xz
(or gz
or bz2
). All programs should operate on compressed files and leave the output with the same level of compression. xz
is recommended as fastest and most efficient. gz
is produces unrepeatable output (and so has no output test), it seems to insist on adding a timestamp.
Fsdb is now thread free and only uses processes for parallelism. This change is a big change--the entire motivation for Fsdb-2 was to exploit parallelism via threading. Parallelism--good, but perl threading--bad for performance. Horribly bad for performance. About 20x worse than pipes on my box. (See perl bug #119445 for the discussion.)
Fsdb::Support::Freds
provides a thread-like abstraction over forking, with some nice support for callbacks in the parent upon child termination.
Details about removing threads: dbpipeline
is thread free, and new tests to verify each of its parts. The easy cases are dbcolpercentile
, dbcolstats
, dbfilepivot
, dbjoin
, and dbcolstatscores
, each of which use it in simple ways (2013-09-09). dbmerge
is now thread free (2013-09-13), but was a significant rewrite, which brought dbsort
along. dbmapreduce
is partly thread free (2013-09-21), again as a rewrite, and it brings dbmultistats
along. Full dbmapreduce
support took much longer (2013-10-02).
When running with user-only output (-n
), dbroweval now resets the output vector $ofref
after it has been output.
dbcolcreate will create all columns at the head of each row with the --first
option.
dbfilecat will concatenate two files, verifying that they have the same schema.
dbmapreduce now passes comments through, rather than eating them as before.
Also, dbmapreduce now supports a --
option to prevent misinterpreting sub-program parameters as for dbmapreduce.
dbmapreduce no longer figures out if it needs to add the key to the output. For multi-key-aware reducers, it never does (and cannot). For non-multi-key-aware reducers, it defaults to add the key and will now fail if the reducer adds the key (with error "dbcolcreate: attempt to create pre-existing column..."). In such cases, one must disable adding the key with the new option --no-prepend-key
.
dbmapreduce no longer copies the input field separator by default. For multi-key-aware reducers, it never does (and cannot). For non-multi-key-aware reducers, it defaults to not copying the field separator, but it will copy it (the old default) with the --copy-fs
option
Corrected a fast busy-wait in dbmerge.
Endgame mode enabled in dbmerge; it (and also large cases of dbsort) should now exploit greater parallelism.
Test case with Fsdb::BoundedQueue
(gone since 2.44) now removed.
Fixed some packaging details. (Really, threads are no longer required, missing tests in the MANIFEST.)
dbsort now better communicates with the merge process to avoid bursty parallelism.
Fsdb::IO::Writer now can take -autoflush =< 1
for line-buffered IO.
Removed some stray "use threads" in some test cases. We didn't need them, and these were breaking non-threaded perls.
Better handling of Fred cleanup; should fix intermittent dbmapreduce failures on BSD.
Improved test framework to show output when tests fail. (This time, for real.)
Test suites now skip tests for libraries that are missing. (Patch for missing IO::Compresss:Xz
contributed by Calvin Ardi.)
Removed references to Jdb in the package specification. Since the name was changed in 2008, there's no longer a huge need for backwards compatibility. (Suggestion form Petr Šabata.)
Test suites now invoke the perl using the path from $Config{perlpath}
. Hopefully this helps testing in environments where there are multiple installed perls and the default perl is not the same as the perl-under-test (as happens in cpantesters.org).
Added specific encoding to this manpage to account for Unicode. Required to build correctly against perl-5.18.
Restored a line in the .spec to chmod g-s.
Unicode decoding is now handled correctly for programs that read from standard input. (Also: New test scripts cover unicode input and output.)
Fix to Fsdb documentation encoding line. Addresses test failure in perl-5.16 and earlier. (Who knew "encoding" had to be followed by a blank line.)
In dbroweval, the -N
(no output, even comments) option now implies -n
, and it now suppresses the header and trailer.
A few more tweaks to the perl-Fsdb.spec from Petr Šabata.
Fixed 3 uses of use v5.10
in test suites that were causing test failures (due to warnings, not real failures) on some platforms.
dbcolcreate now has a --no-recreate-fatal
that causes it to ignore creation of existing columns (instead of failing).
dbmapreduce once again is robust to reducers that output the key; --no-prepend-key
is no longer mandatory.
dbcolsplittorows can now enumerate the output rows with -E
.
dbcolmovingstats is more mathematically robust. Previously for some inputs and some platforms, floating point rounding could sometimes cause squareroots of negative numbers.
sqlselect_to_db converts the output of the MySQL or MarinaDB select comment into fsdb format.
dbfilediff now outputs the second row when doing sloppy numeric comparisons, to better support test suites.
Test suites changes to be robust to exact line numbers of failures, since different Perl releases fail on different lines. https://bugzilla.redhat.com/show_bug.cgi?id=1158380
The dbfilediff how supports a --quiet
option.
Better documention of dbpipeline_filter.
Added groff-base and perl-podlators to the Fedora package spec. Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1163149. (Also in package 2.52-2.)
An important stability improvement to dbmapreduce. It, plus dbmultistats, and dbcolstats now support controlled parallelism with the --pararallelism=N
option. They default to run with the number of available CPUs. dbmapreduce also moderates its level of parallelism. Previously it would create reducers as needed, causing CPU thrashing if reducers ran much slower than data production.
The combination of dbmapreduce with dbrowenumerate now works as it should. (The obscure bug was an interaction with dbcolcreate with non-multi-key reducers that output their own key. dbmapreduce has too many useful corner cases.)
Sigh, the test suite now has a test suite. Because, yes, I broke it, causing many incorrect failures at cpantesters. Now fixed.
dbfilediff now can be extra quiet, as I continue to try to track down a numeric difference on FreeBSD AMD boxes.
dbcolmovingstats gave different test output (just reflecting rounding error) when stddev approaches zero. We now detect hand handle this case. See <https://rt.cpan.org/Public/Bug/Display.html?id=101220> and thanks to H. Merijn Brand for the bug report.
Many, many spelling bugs found by H. Merijn Brand; thanks for the bug report.
A number of programs had misspelled "separator" in --fieldseparator
and --columnseparator
options as "seperator". These are now correctly spelled.
Internal argument parsing uses Getopt::Long, but mixed pass-through and <>. Bug reported by Petr Pisar at https://bugzilla.redhat.com/show_bug.cgi?id=1188538.a
Added missing BuildRequires for XML::Simple
.
dbfilecat now honors --remove-inputs
(previously it didn't). This omission meant that dbmapreduce (and dbmultistats) would accumulate files in /tmp when running. Bad news for inputs with 4M keys.
dbmultistats should be faster with lots of small keys. dbcolstats now supports -k
to get some of the functionality of dbmultistats (if data is pre-sorted and median/quartiles are not required).
dbfilecat now honors --remove-inputs
(previously it didn't). This omission meant that dbmapreduce (and dbmultistats) would accumulate files in /tmp when running. Bad news for inputs with 4M keys.
Fixed a case where dbmerge suffered mojobake in endgame mode. This bug surfaced when dbsort was applied to large files (big enough to require merging) with unicode in them; the symptom was something like: Wide character in print at /usr/lib64/perl5/IO/Handle.pm line 420, <GEN12> line 111.
More IO is explicitly marked UTF-8 to avoid Perl's tendency to mojibake on otherwise valid unicode input. This change helps html_table_to_db.
dbcolscorrelate now crossreferences dbcolsregression.
Documentation for dbrowdiff now clarifies that the default is baseline mode.
dbjoin now propagates -T
into the sorting process (if it is required). Thanks to Lan Wei for reporting this bug.
dbjoin now supports hash joins with -t lefthash
and -t righthash
. Hash joins cache a table in memory, but do not require that the other table be sorted. They are ideal when joining a large table against a small one.
dbjoin now handles left and right outer joins with -t left
and -t right
.
dbjoin hash joins are now selected with -m lefthash
and -m righthash
(not the shortlived -t righthash
option). (Technically this change is incompatible with Fsdb-2.60, but no one but me ever used that version.)
Documentation for xml_to_db now includes sample output.
yaml_to_db converts a specific form of YAML to fsdb.
The test suite now uses diff -c -b
rather than diff -cb
to make OpenBSD-5.9 happier, I hope.
Comments that log operations at the end of each file now do simple quoting of spaces. (It is not guaranteed to be fully shell-compliant.)
There is a new standard option, --header
, allowing one to specify an Fsdb header for inputs that lack it. Currently it is supported by dbcoldefine, dbrowuniq, dbmapreduce, dbmultistats, dbsort, dbpipeline.
dbfilepivot now allows the --possible-pivots option, and if it is provided processes the data in one pass.
dbroweval logs are now quoted.
The option -j is now a synonym for --parallelism. (And several documention bugs about this option are fixed.)
Additional support for --header
in dbcolmerge, dbcol, dbrow, and dbroweval.
Version 2.62 was supposed to have this improvement, but did not (and now does): dbfilepivot now allows the --possible-pivots option, and if it is provided processes the data in one pass.
Version 2.62 was supposed to have this improvement, but did not (and now does): dbroweval logs are now quoted.
In dbroweval, the next row
option previously did not correctly set up _last_fieldname
. It now does.
The csv_to_db converter now has an optional -F x
option to set the field separator.
Finally dbcolsplittocols has a --header
option, and a new -N
option to give the list of resulting output columns.
Now dbcolstats and dbmultistats produce no output (but a schema) when given no input but a schema. Previously they gave a null row of output. The --output-on-no-input
and --no-output-on-no-input
options can control this behavior.
dbmultistats and dbmapreduce now both take a -F x
option to set the field separator.
Fixed missing use Carp
in dbcolstats. Also went back and cleaned up all uses of croak()
. Thanks to Zefram for the bug report.
Removed old tests from MANIFEST. (Thanks to Hang Guo for reporting this bug.)
Errors for non-existing input files now include the bad filename (before: "cannot setup filehandle", now: "cannot open input: cannot open TEST/bad_filename").
Hash joins with three identical rows were failing with the assertion failure "internal error: confused about overflow" due to a now-fixed bug.
dbformmail now has an "mh" mechanism that writes messages to individual files (an mh-style mailbox).
dbrow failed to include the Carp library, leading to fails on croak.
Fixed dbjoin error message for an unsorted right stream was incorrect (it said left).
All Fsdb programs can now read from and write to HDFS, when files that start with "hdfs:" are given to -i and -o options.
The omitted-possible-error test case for dbfilepivot now has an altnerative output that I saw on some BSD-running systems (thanks to CPAN).
dbmerge and dbmerge2 now support --header
. dbmerge2 now gives better error messages when presented the wrong number of inputs.
dbsort now works with --header
even when the file is big (due to fixes to dbmerge).
cvs_to_db now processes data with the "binary" option, allowing it to handle newlines embedded in quoted fields.
All programs now will transparently decompress input files, if they are listed as a filename as an input argument that extends with a standard extension (.gz, .bz2, and .xz).
Filled in the the test case for autodecompress, which was missing for the 2.68 release.
The groff program is required for build, and the Makefile.PL
fails if groff is missing at build time. Thanks to Chris Williams for suggesting this check, and the CPAN auto-building system for trying many platforms.
The dbcolstats program had numerical instability that sometimes results in failing with a square-root of a negative number when many values varied right at the edge of floating-point precision. We now detect and report that case as 0 stddev. Thanks to Hang Guo for providing a test case.
dbcol can now take an option -a
to include all columns, allowing reordering of certain columns while passing the rest through.
dbrowuniq and dbmerge now buffer comments in a way that the last row of data output is no longer in the last block of comments. (The data is identical, but for humans looking at output, this change makes it less likely to lose the last row.)
dbmultistats and dbpipeline documentation now indicates that they support --header
(something they did since version 2.62 in 2016-11-29, but now documented.
dbcolcreate now supports --header
.
Fixed several spelling errors in deprecated programs and removed information about the no-longer existing FreeBSD and MacOS ports. Thanks to Calvin Ardi for the patch.
dbmerge now handles --xargs when only one file is provided (and passes the file through unchanged). It also throws a clean error with --xargs if zero files are provided. (To support dbmerge, dbcol now has an internal --saveoutput
option.) Thanks to Yuri Pradkin for reporting the unhandled corner-case.
Suppress a race condition in dbcolmerge was sometimes throwing the error "Fsdb::Support::Freds: ending, but running process: dbmerge:xargs" in the dbmerge_0_xargs test case, on exit.
dbcolhisto now handles the degenerate case where everything has the same value (previously it would throw "illegal division by zero").
The spec for Fedora now includes make
as BuildRequires, something required for Fedora 34.
--weighted
, and with more ipv6.dbcolpercentile now has a --weighted
option.
The new Fsdb::Support::IPv6 package includes ipv6_normalize, ipv6_zeroize to rewrite ipv6 print addresses in IPv6 normal form, with a 0 in each 4-nybble field.
Fsdb::Support::IPv6 package includes ipv6_fullhex to rewrite ipv6 print addresses as full, 128-bit hex values.
Add optional type specifications to the schema. Types are not used in Perl, but are relevant in Python and Go Fsdb bindings. Types use a subset of perl pack specifiers: c, s, l, q are signed 8, 16, 32, and 64-bit integers, f is a float, d is double float, a is utf-8 string, and > and < can force big or little endianness. The default type for everything is "a", that is, utf-8 strings. Thanks to Wes Hardaker for pushing to get this long-desired feature out the door; his Python bindings need types.
dbcol, dbcolcreate, dbcolcopylast, and dbcolrename now understand and propagate schema types. dbsort, dbjoin, dbmerge, dbmerge2 and dbfilepivot all take a new option -t
to sort by type-inferred comparision, if a type is given.
dbcolstat, dbmultistats, and dbcolmovingstats now include type information in their output schema. (They assumes input variables are floats, not integers.)
Even more IPv6: the functions in Fsdb::Support::IPv6 package now support strings of hex digits as an alternate encoding for IP address (and they are already the output of ipv6_fullhex), and ip_fullhex_to_normal
converts full hex-encoded IPv4 or IPv6 addresses to their "normal" form (dotted-quad or IPv6 printable format).
The major version number is now 3.0 to correspond to the addition of types (although they were actually added in 2.75). Old fsdb files are supported (Fsdb-3.0 is backwards compatible with databases), but older versions will confuse types in new files (new Fsdb files are not forward compatible with old versions).
Type specifications in a few more programs: dbcolhisto, dbcolscorrelate, dbcolsregression, dbcolstatscores, dbrowaccumulate, dbrowcount, dbrowdiff, dbrvstatdiff.
dbcolhisto now puts an empty value on any empty rows.
dbcoltype redefines column types, or clears them with the -v
option.
Type specifications in a few more programs that I missed: dbrowuniq, dbcolpercentile.
Minor documentation improvements.
dbcolsdecimate reduces density in timeseries data to make graphs with overly dense points visually similar but smaller.
yaml_to_db now flattens one level of arrays into comma-separated lists.
Clearer installation instructions.
dbcolsdecimate now takes either relative (-p) or absolute (-P) precision, and precision now affects only subsequent columns. Also, if absolute precisions are given for all columns, data is not buffered.
dbcolsdecimate now has examples in its documentation.
dbcolsstats, dbmapreduce, dbcolpercentile, dbfilepivot, and dbmultistats now correctly propagate the temporary directory into the sort route, if required. All of these programs sometimes require sort internally, and previously may have failed to use the correct tmpdir when it was set on the command line as an option. Thanks to Erica Stutz for noticing this bug.
dbrowdiff now has a --future
option that compares incrementally with the next row rather than the previous.
dbcolstats and dbcolscorrelate now optionally do weighted stats with the --weight
option.
dbcolscorrelate was not properly applying weighting.
dbrowdiff now has -A
and -P
options to set output column names.
dbmultistats now supports weighted stats.
dbmerge now correctly handles the case when invoked with --xargs
with exactly two input files. Thanks to Erica Stutz for reporting this error.
John Heidemann, johnh@isi.edu
See "Contributors" for the many people who have contributed bug reports and fixes.
Fsdb is Copyright (C) 1991-2024 by John Heidemann <johnh@isi.edu>.
This program is free software; you can redistribute it and/or modify it under the terms of version 2 of the GNU General Public License as published by the Free Software Foundation.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
A copy of the GNU General Public License can be found in the file ``COPYING''.
Any comments about these programs should be sent to John Heidemann johnh@isi.edu
.