ArchiveBot Redis layout¶

The ArchiveBot pipelines and backend share a single Redis database. This document describes the keys in that database.

Keys do not follow any namespace-prefixing convention; ArchiveBot assumes it has full control over the database.

Connection¶

Pipelines connect directly to the Redis database, typically over SSH or spiped. The backend connects the same way. There is no access control from either side.

`pipeline:PIPELINE_ID`¶

Type: hash

Keys matching this form describe pipelines. PIPELINE_ID is a hexadecimal number that is generated by a pipeline process on startup. The pipeline process periodically updates its data while it runs.

Hash keys¶

Key	Intended type	Meaning
disk_usage	Decimal	% of the pipeline’s filesystem in use
disk_available	Integer	Bytes available on the pipeline’s filesystem
fqdn	String	FQDN of the host running the pipeline
hostname	String	Short name of the host
id	String	The pipeline’s ID; always matches the hash key
load_average_1m	Decimal	Load average over the past minute
load_average_5m	Decimal	“” “” “” “” over the past 5 minutes
load_average_15m	Decimal	“” “” “” “” over the past 15 minutes
mem_available	Integer	Bytes of memory available on the host
mem_usage	Decimal	% memory in use on the host
nickname	String	The pipeline nickname
pid	Integer	The PID of the pipeline process
python	String	The version of Python running the pipeline
ts	UNIX timestamp	The last time this pipeline record was updated
version	String	The pipeline’s version

`IDENT` (i.e. `[a-z0-9]{25,}`)¶

Type: hash

These are job records. These are the most common record type in ArchiveBot’s database.

Parts of this record are frequently modified by both the backend and pipeline:

whenever a response is recorded
whenever job settings are changed

Hash keys¶

Key	Intended type	Meaning
bytes_downloaded	Integer	Bytes downloaded from the target site
concurrency	Integer	Current number of concurrent downloaders
death_timer	Integer	Number of liveness checks that have gone without a response
delay_max	Integer	Maximum delay between two requests on a downloader in ms
delay_min	Integer	Minimum delay between two requests on a downloader in ms
error_count	Integer	Number of error (i.e. 4xx, 5xx) responses encountered
fetch_depth	String	“shallow” for !ao jobs; “inf” for !a jobs
finished_at	UNIX ts w/ frac	When the job finished; not present if the job is running
heartbeat	Integer	Set by the pipeline; incremented once per heartbeat
ignore_patterns_set_key	String	The key storing this job’s ignore patterns
items_downloaded	Integer	Number of 2xx/3xx responses
items_queued	Integer	Number of URLs encountered in the job
last_acknowledged_heartbeat	Integer	Set by the backend; is the last heartbeat received
last_analyzed_log_entry	Integer	The last log entry index analyzed by the backend [1]
last_broadcasted_log_entry	Integer	“” “” “” “” “” “” “” “” broadcasted over the firehose [1]
last_trimmed_log_entry	Integer	“” “” “” “” “” “” “” “” trimmed by the log trimmer [1]
log_key	String	The key storing this job’s log messages
log_score	Integer	The current log entry index
next_watermark	Integer	A threshold for number of queued URLs; currently unused
pipeline_id	String	The pipeline running this job; corresponds to a pipeline:* key
queued_at	UNIX ts w/ frac	When this job was queued
r1xx	Integer	Number of 1xx responses
r2xx	Integer	“” “” “” 2xx responses
r3xx	Integer	“” “” “” 3xx responses
r4xx	Integer	“” “” “” 4xx responses
r5xx	Integer	“” “” “” 5xx responses
runk	Integer	“” “” “” responses with unknown HTTP status code
recorded_at	UNIX ts w/ frac	Deprecated. When this job was logged to ArchiveBot’s CouchDB
settings_age	Integer	Job settings version; incremented for each settings change
slug	String	WARC/JSON base filename [2]
started_at	UNIX ts w/ frac	When this job was started by a pipeline
started_by	String	The user (typically an IRC nick) that submitted the job
started_in	String	Where the job was started (typically an IRC channel)
suppress_ignore_reports	Boolean	Whether ignore pattern matches should be reported
ts	UNIX ts w/ frac	Last update received from a pipeline for this job
url	String	The URL for this job: either the target or a URL file (for !ao < and !a <)
user_agent	String	The user-agent to spoof; null if we should use the default agent

[1]: The expected relationship between these values is

last_analyzed_log_entry <= last_broadcasted_log_entry <= last_trimmed_log_entry

[2]: Usually looks like “twitter.com-inf”. The date, time, WARC sequence, extension, etc. are all appended by the pipeline.

`IDENT_ignores`¶

Type: set

Ignore patterns for the identified job. Each ignore pattern is a Python regex.

`IDENT_log`¶

Type: zset

Log entries generated for a job by the wpull hooks or pipeline stdout capture are sent here. The backend is notified of new entries in this set when the pipeline publishes the job ident on the updates channel.

`pipelines`¶

Type: list

Deprecated. This list contains pipeline names, and is still modified by pipelines, but no pipeline listing uses it.

`jobs_completed`, `jobs_aborted`, `jobs_failed`¶

Type: string

These keys store counts of completed, aborted, and failed jobs, respectively.

A completed job is a job that made it through the entire ArchiveBot pipeline. An aborted job is a job that was terminated using !abort. A failed job is a job that crashed and was reaped using the internal console.

`tweets:done`, `tweets:queue`¶

Type: zset

These are used by ArchiveBot’s Twitter tweeter. They store tweets that were tweeted and tweets in the to-post queue, respectively.

Pubsub channels¶

`updates`¶

Whenever a pipeline has new log entries for a job, it publishes that job’s ident to this channel.

`archivebot:job:IDENT`¶

There exists one of these channels per job.

When settings are updated for that job, the new settings age is published via this channel. The job’s settings listener receives the new version. If the new version is greater than the current version, the new settings are read from Redis and applied.

ArchiveBot Redis layout¶

Connection¶

pipeline:PIPELINE_ID¶

Hash keys¶

IDENT (i.e. [a-z0-9]{25,})¶

Hash keys¶

IDENT_ignores¶

IDENT_log¶

pipelines¶

jobs_completed, jobs_aborted, jobs_failed¶

tweets:done, tweets:queue¶

Pubsub channels¶

updates¶

archivebot:job:IDENT¶

`pipeline:PIPELINE_ID`¶

`IDENT` (i.e. `[a-z0-9]{25,}`)¶

`IDENT_ignores`¶

`IDENT_log`¶

`pipelines`¶

`jobs_completed`, `jobs_aborted`, `jobs_failed`¶

`tweets:done`, `tweets:queue`¶

`updates`¶

`archivebot:job:IDENT`¶