ArchiveBot Redis layout

The ArchiveBot pipelines and backend share a single Redis database. This document describes the keys in that database.

Keys do not follow any namespace-prefixing convention; ArchiveBot assumes it has full control over the database.

Connection

Pipelines connect directly to the Redis database, typically over SSH or spiped. The backend connects the same way. There is no access control from either side.

pipeline:PIPELINE_ID

Type: hash

Keys matching this form describe pipelines. PIPELINE_ID is a hexadecimal number that is generated by a pipeline process on startup. The pipeline process periodically updates its data while it runs.

Hash keys

Key Intended type Meaning
disk_usage Decimal % of the pipeline’s filesystem in use
disk_available Integer Bytes available on the pipeline’s filesystem
fqdn String FQDN of the host running the pipeline
hostname String Short name of the host
id String The pipeline’s ID; always matches the hash key
load_average_1m Decimal Load average over the past minute
load_average_5m Decimal “” “” “” “” over the past 5 minutes
load_average_15m Decimal “” “” “” “” over the past 15 minutes
mem_available Integer Bytes of memory available on the host
mem_usage Decimal % memory in use on the host
nickname String The pipeline nickname
pid Integer The PID of the pipeline process
python String The version of Python running the pipeline
ts UNIX timestamp The last time this pipeline record was updated
version String The pipeline’s version

IDENT (i.e. [a-z0-9]{25,})

Type: hash

These are job records. These are the most common record type in ArchiveBot’s database.

Parts of this record are frequently modified by both the backend and pipeline:

  • whenever a response is recorded
  • whenever job settings are changed

Hash keys

Key Intended type Meaning
bytes_downloaded Integer Bytes downloaded from the target site
concurrency Integer Current number of concurrent downloaders
death_timer Integer Number of liveness checks that have gone without a response
delay_max Integer Maximum delay between two requests on a downloader in ms
delay_min Integer Minimum delay between two requests on a downloader in ms
error_count Integer Number of error (i.e. 4xx, 5xx) responses encountered
fetch_depth String “shallow” for !ao jobs; “inf” for !a jobs
finished_at UNIX ts w/ frac When the job finished; not present if the job is running
heartbeat Integer Set by the pipeline; incremented once per heartbeat
ignore_patterns_set_key String The key storing this job’s ignore patterns
items_downloaded Integer Number of 2xx/3xx responses
items_queued Integer Number of URLs encountered in the job
last_acknowledged_heartbeat Integer Set by the backend; is the last heartbeat received
last_analyzed_log_entry Integer The last log entry index analyzed by the backend [1]
last_broadcasted_log_entry Integer “” “” “” “” “” “” “” “” broadcasted over the firehose [1]
last_trimmed_log_entry Integer “” “” “” “” “” “” “” “” trimmed by the log trimmer [1]
log_key String The key storing this job’s log messages
log_score Integer The current log entry index
next_watermark Integer A threshold for number of queued URLs; currently unused
pipeline_id String The pipeline running this job; corresponds to a pipeline:* key
queued_at UNIX ts w/ frac When this job was queued
r1xx Integer Number of 1xx responses
r2xx Integer “” “” “” 2xx responses
r3xx Integer “” “” “” 3xx responses
r4xx Integer “” “” “” 4xx responses
r5xx Integer “” “” “” 5xx responses
runk Integer “” “” “” responses with unknown HTTP status code
recorded_at UNIX ts w/ frac Deprecated. When this job was logged to ArchiveBot’s CouchDB
settings_age Integer Job settings version; incremented for each settings change
slug String WARC/JSON base filename [2]
started_at UNIX ts w/ frac When this job was started by a pipeline
started_by String The user (typically an IRC nick) that submitted the job
started_in String Where the job was started (typically an IRC channel)
suppress_ignore_reports Boolean Whether ignore pattern matches should be reported
ts UNIX ts w/ frac Last update received from a pipeline for this job
url String The URL for this job: either the target or a URL file (for !ao < and !a <)
user_agent String The user-agent to spoof; null if we should use the default agent

[1]: The expected relationship between these values is

last_analyzed_log_entry <= last_broadcasted_log_entry <= last_trimmed_log_entry

[2]: Usually looks like “twitter.com-inf”. The date, time, WARC sequence, extension, etc. are all appended by the pipeline.

IDENT_ignores

Type: set

Ignore patterns for the identified job. Each ignore pattern is a Python regex.

IDENT_log

Type: zset

Log entries generated for a job by the wpull hooks or pipeline stdout capture are sent here. The backend is notified of new entries in this set when the pipeline publishes the job ident on the updates channel.

pipelines

Type: list

Deprecated. This list contains pipeline names, and is still modified by pipelines, but no pipeline listing uses it.

jobs_completed, jobs_aborted, jobs_failed

Type: string

These keys store counts of completed, aborted, and failed jobs, respectively.

A completed job is a job that made it through the entire ArchiveBot pipeline. An aborted job is a job that was terminated using !abort. A failed job is a job that crashed and was reaped using the internal console.

tweets:done, tweets:queue

Type: zset

These are used by ArchiveBot’s Twitter tweeter. They store tweets that were tweeted and tweets in the to-post queue, respectively.

Pubsub channels

updates

Whenever a pipeline has new log entries for a job, it publishes that job’s ident to this channel.

archivebot:job:IDENT

There exists one of these channels per job.

When settings are updated for that job, the new settings age is published via this channel. The job’s settings listener receives the new version. If the new version is greater than the current version, the new settings are read from Redis and applied.