ArchiveBot Redis layout¶
The ArchiveBot pipelines and backend share a single Redis database. This document describes the keys in that database.
Keys do not follow any namespace-prefixing convention; ArchiveBot assumes it has full control over the database.
Pipelines connect directly to the Redis database, typically over SSH or spiped. The backend connects the same way. There is no access control from either side.
Keys matching this form describe pipelines. PIPELINE_ID is a hexadecimal number that is generated by a pipeline process on startup. The pipeline process periodically updates its data while it runs.
|disk_usage||Decimal||% of the pipeline’s filesystem in use|
|disk_available||Integer||Bytes available on the pipeline’s filesystem|
|fqdn||String||FQDN of the host running the pipeline|
|hostname||String||Short name of the host|
|id||String||The pipeline’s ID; always matches the hash key|
|load_average_1m||Decimal||Load average over the past minute|
|load_average_5m||Decimal||“” “” “” “” over the past 5 minutes|
|load_average_15m||Decimal||“” “” “” “” over the past 15 minutes|
|mem_available||Integer||Bytes of memory available on the host|
|mem_usage||Decimal||% memory in use on the host|
|nickname||String||The pipeline nickname|
|pid||Integer||The PID of the pipeline process|
|python||String||The version of Python running the pipeline|
|ts||UNIX timestamp||The last time this pipeline record was updated|
|version||String||The pipeline’s version|
These are job records. These are the most common record type in ArchiveBot’s database.
Parts of this record are frequently modified by both the backend and pipeline:
- whenever a response is recorded
- whenever job settings are changed
|bytes_downloaded||Integer||Bytes downloaded from the target site|
|concurrency||Integer||Current number of concurrent downloaders|
|death_timer||Integer||Number of liveness checks that have gone without a response|
|delay_max||Integer||Maximum delay between two requests on a downloader in ms|
|delay_min||Integer||Minimum delay between two requests on a downloader in ms|
|error_count||Integer||Number of error (i.e. 4xx, 5xx) responses encountered|
|fetch_depth||String||“shallow” for !ao jobs; “inf” for !a jobs|
|finished_at||UNIX ts w/ frac||When the job finished; not present if the job is running|
|heartbeat||Integer||Set by the pipeline; incremented once per heartbeat|
|grabber||String||“phantomjs” for PhantomJS jobs; omitted otherwise|
|ignore_patterns_set_key||String||The key storing this job’s ignore patterns|
|items_downloaded||Integer||Number of 2xx/3xx responses|
|items_queued||Integer||Number of URLs encountered in the job|
|last_acknowledged_heartbeat||Integer||Set by the backend; is the last heartbeat received|
|last_analyzed_log_entry||Integer||The last log entry index analyzed by the backend |
|last_broadcasted_log_entry||Integer||“” “” “” “” “” “” “” “” broadcasted over the firehose |
|last_trimmed_log_entry||Integer||“” “” “” “” “” “” “” “” trimmed by the log trimmer |
|log_key||String||The key storing this job’s log messages|
|log_score||Integer||The current log entry index|
|next_watermark||Integer||A threshold for number of queued URLs; currently unused|
|no_phantomjs_smart_scroll||Boolean||Whether or not PhantomJS’ smart-scroll should be used|
|phantomjs_scroll||Integer||Maximum number of times to scroll the page with PhantomJS|
|phantomjs_wait||Integer||Maximum wait time between PhantomJS page interactions|
|pipeline_id||String||The pipeline running this job; corresponds to a pipeline:* key|
|queued_at||UNIX ts w/ frac||When this job was queued|
|r1xx||Integer||Number of 1xx responses|
|r2xx||Integer||“” “” “” 2xx responses|
|r3xx||Integer||“” “” “” 3xx responses|
|r4xx||Integer||“” “” “” 4xx responses|
|r5xx||Integer||“” “” “” 5xx responses|
|runk||Integer||“” “” “” responses with unknown HTTP status code|
|recorded_at||UNIX ts w/ frac||Deprecated. When this job was logged to ArchiveBot’s CouchDB|
|settings_age||Integer||Job settings version; incremented for each settings change|
|slug||String||WARC/JSON base filename |
|started_at||UNIX ts w/ frac||When this job was started by a pipeline|
|started_by||String||The user (typically an IRC nick) that submitted the job|
|started_in||String||Where the job was started (typically an IRC channel)|
|suppress_ignore_reports||Boolean||Whether ignore pattern matches should be reported|
|ts||UNIX ts w/ frac||Last update received from a pipeline for this job|
|url||String||The URL for this job: either the target or a URL file (for !ao < and !a <)|
|user_agent||String||The user-agent to spoof; null if we should use the default agent|
: The expected relationship between these values is
last_analyzed_log_entry <= last_broadcasted_log_entry <= last_trimmed_log_entry
: Usually looks like “twitter.com-inf”. The date, time, WARC sequence, extension, etc. are all appended by the pipeline.
Ignore patterns for the identified job. Each ignore pattern is a Python regex.
Log entries generated for a job by the wpull hooks or pipeline stdout capture
are sent here. The backend is notified of new entries in this set when the
pipeline publishes the job ident on the
Deprecated. This list contains pipeline names, and is still modified by pipelines, but no pipeline listing uses it.
These keys store counts of completed, aborted, and failed jobs, respectively.
A completed job is a job that made it through the entire ArchiveBot pipeline.
An aborted job is a job that was terminated using
A failed job is a job that crashed and was reaped using the internal console.
These are used by ArchiveBot’s Twitter tweeter. They store tweets that were tweeted and tweets in the to-post queue, respectively.
Whenever a pipeline has new log entries for a job, it publishes that job’s ident to this channel.
There exists one of these channels per job.
When settings are updated for that job, the new settings age is published via this channel. The job’s settings listener receives the new version. If the new version is greater than the current version, the new settings are read from Redis and applied.