This page describes the format of our Internet outage datasets.
We have three primary formats:
We recommend using outagedownup data for most purposes, since it includes post-processing cleans up some known flaws that occur in the raw data.
As of Oct. 2017, our sites (Vantage Points) are:
w: ISI-West in Los Angeles;
c: Colorado data from Ft. Collins;
j: Japan data from Keio University (SJF campus) near Tokyo;
e: ISI-East data from near Washington, DC;
g: Greek data from Athens University of Economics in Business;
n: Netherlands data from SurfNet.
Outage probing output is provided for each vantage point in “outageraw” format. (see “Sites” above for details).
This dataset format includes information about every single ping, plus our evolving estimate of block responsivness (the “A” value). We convert it into address accumulation datasets and merge multiple sites into outages format (described below).
Each dataset includs input to the prober in several formats and the output.
Output documents Trinocular “rounds”. Each round is a set of pings that conclude in a determination of block status, or, rarely, abort with an indeterminate status after 15 tries.
Output is in tab-separated text (FSDB format, with the following schema:
block: hex format of /24 IP block, with trailing zeros (A7 omits trailing zeros)
round_no: round number in this batch (will reset each time we restart)
round_start_epoch: when the round began, in seconds since 1970
a_short: the short term estimate of availability
a_oper: the operational estimate of A value (long term and reflecting variance)
status: status of this block: A12 and later: 0 for down, 1 for up, 2 for unknown
belief: our belief the block is down
n_pos: number of positive responses in this round
n_neg: number of negative responses in this round
probe_log: A base-64-encoded list of what specific addresses were probed. (Only in a18 and later).
rtt_us: estimated round-trip time in microseconds. (Only in a20 and later.)
A sample of raw data from dataset internet_outage_adaptive_a30w-20171006
, file data/pinger-w4.e1507326545.a30w.2.r0.001.fsdb.bz2
:
#fsdb -F t block round_no round_start_epoch a_short a_oper status belief n_pos n_neg probe_log rtt_us
58bae200 0 1507326545 0.8766 0.4383 1 0.01 1 0 CiQ= 219051
bd378a00 0 1507326545 0.3644 0.1822 1 0.01 1 0 CqQ= 204900
342e1a00 0 1507326545 0.2879 0.1439 1 0.01 1 0 CgQ= 158242
83c1c200 0 1507326545 0.1323 0.06614 1 0.01 1 0 CqU= 64556
d02afa00 0 1507326545 0.7865 0.3932 1 0.01 1 0 CuQ= 60168
...
The data shows the schema (the #fsdb
line), followed by data
for block 0x58bae200, which is 88.186.226.0/24,
taken at 1507326545 seconds past the Unix epoch (2017-10-06t21:49:05Z).
The block was detected as up (status is 1),
and the positive ping replied in 219.051ms.
Other lines show other blocks, all probed at this time.
“Outages” format data merges all observing sites for one time period (see “Sites” above for details, time periods are typically quarters).
Outages datasets show our best estimate from all observers, but it also reveals when they disagree.
Output is in tab-separated text (FSDB format, with the following schema:
block: block address of the /24 in hex (with trailing zeros)
start: when the status was takes effect (seconds since the Unix epoch)
duration: how long the status is in effect
uncertainty: our confidence in the precision of the start time. In non-raw data uncertainty is sometimes lowered when we merge observations from multiple observers.
precision_improvement: is either unused (‘-‘) or precision improvement of the onset of a state change resulting from merging data from multiple vantage points
status: vantage point that saw the outage (each letter ‘c’,’j’,’w’, ‘g’ is one of the sites from our observers; corresponding capital letter ‘W’, ‘C’, ‘J’, ‘G’ means the vantage point saw no outage; the order is fixed to [wW][cC][jJ][gG])
A sample of outages data from dataset internet_outage_adaptive_a30all-20171006
, file `a30all.outages.fsdb.bz2:
#fsdb -F t block start duration uncertainty status
01000400 1507326957 23 660 W
01000400 1507326980 890349 594 WCJGEN
...
01000400 1512522614 2242390 660 WJGEN
01000400 1514765004 839 242 WEN
01000400 1514765843 1082 331 E
01000500 1507326628 23 660 W
01000500 1507326651 1957 591 WCJGEN
01000500 1507328608 692 637 Wcjgen
...
These two segments show outages for two blocks. For the first, block 0x01000400 (1.0.4.0/24), was up (capital letters in status), as detected by site W at time 1507326957 (2017-10-06t21:55:57Z), and seen by all 6 sites in the next line 23 seconds later.
The second block, 0x01000500 (1.0.5.0/24) was detected as up by site W at time 1507326628 (2017-10-06t21:50:28Z), followed by all the other sites 23 seconds later. However, at time 1507328608 (2017-10-06t22:23:28Z) all sites except for W failed to detect it as up.
“Outagesdownup” format data merges all observing sites for one time period (see “Sites” above for details, time periods are typically quarters). It also includes several post-processing step:
Outagesdownup format represents our best estimate for outages at any given target block, using all the information we have.
Output is in tab-separated text (FSDB format, with the following schema:
block: block address of the /24 in hex (with trailing zeros)
start: when the status was takes effect, in seconds since the Unix epoch.
duration: how long the status is in effect, in seconds.
uncertainty: our confidence in the precision of the start time. The true start time is sometime between start and start-uncertainty. The true duration is between duration-NextEventUncertainty and duration+ThisEventUncertainty. In non-raw data uncertainty is sometimes lowered when we merge observations from multiple observers.
downup: up (1), down (0), unmeasurable (-1, typically due to insufficient active observers), or gone dark (-2, typically out for more than 10 days)
Sample data, from dataset internet_outage_adaptive_a30all-20171006
, file a30all.outagedownup.fsdb.bz2
:
#fsdb -F t block start duration uncertainty downup
01000400 1507326957 7439968 331 1
01000500 1507326628 7439966 660 1
01000600 1507327123 7439309 333 1
01005000 1507326901 3269846 12540 1
01005000 1510596747 32046 7920 0
01005000 1510628793 47225 7920 1
01005000 1510676018 35681 11972 0
...
This data shows that blocks 0x01000400 (1.0.4.0/24), 0x01000500 (1.0.5.0/24), and 0x01000600 (1.0.6.0/24), were up (downup is 1) for the entire observtion period (starting at 1507326957, 2017-10-06t21:55:57Z and continuing for 7439968 seconds, just more than 86 days).
Block 0x01005000 (1.0.80.0/24) was up starting at 1507326901 (2017-10-06t21:55:01Z) for 3269846 seconds (37.8 days), then down for 32046 seconds (8.9 hours), then up for 47225 seconds (13.1 hours), etc.
We have had several different of our outage data processing pipeline as we learn more. In general, we have two goals in our datasets: to be as accurate as possible to what really happened, and to provide a long-term result for others to use.
These two goals are in conflict, so to resolve that confict we sometimes update our datasets with recomputed results while preserving the old results as different files in the same database.
All datasets now include a “vX” tag that indicates the version.
Here is our summary:
Version | input | raw Trinocular (icmptrain, per-site data) | aXXall.vYY.outages.fsdb.bz2 (FBS+LABR, raw to outages, merge, precision improvement) |
aXXall.vYY.outagedownup.fsdb.bz2 (disagreement resolution, hole filling, gone-dark) |
---|---|---|---|---|
v1 | target blocks: |E(b)| >= 15 and |A(E(b))| equal to 0.1, from Quan13c | a_oper: do not include down events to calculate a_oper probing order: per-block probe-order order is randomized from full round (FR) to FR |
survey edges: no pre-staging of before & after quarter data hole filling: raw to outages: single unknown states in between 2 rounds of equal status, has its status set to the same status as the other 2 precision improvement: forward precision-improvement, from Quan13c section 4.5 |
gone-dark: 1 week windows need 0.8 up time, otherwise set to -2, from Alwabel15a multi-site resolution: any-up |
v1b | same | same | same | gone-dark: improvements in downup_to_unmeasurable.py code, window increase from 1 to 3 weeks |
v2 | same | same | raw to outages: unknowns are not fixed precision improvement: backward precision improvement |
gone-dark: same as in v1 but fixed some bugs |
v3 | same | same | same | internal only |
v4 | same | same | a_oper: a_oper as a new column in outages format (from Quan14c) | gone dark: outages longer than 1 week set to -2 a_oper: adds a_oper as a new column in outagedownup format (from Quan14c) |
v5 | same | same | survey edges: added from 1 week before/after survey for proper gone-dark filtering, motivated by gone-dark in Alwabel15a FBS: full block scanning over flaky blocks; outages in sparse blocks sometimes mapped to up, from Baltra19b LABR: lone block recovery algorithm, single addresses down events mapped to unknown, from Baltra19b |
multi-site resolution: majority voting, from Baltra19a |
v5b | same | same | FBS: a_short bug fixed FBS: full round (FR) completion (after a non-UP round): we are willing to count down probes in the first TR that includes a positive response against the FR accumulation FBS: windowing - require 2FRs to due to round reordering, for data on or before 2019q4 |
same |
v5c | target blocks: blocks with |E(b)| >= 3 | extra probe: send 16th probe if to old known replier block changes state probing order: no longer change order each round |
FBS: windowing - FBS defaults to 1FR (although when we run on datasets on 2019q4 or earlier, we need to manually override to 2FR) | same |