MongoDB Implementation of FetchQueue Interface for Simplecrawler
This is an implementation of FetchQueue Interface for simplecrawler queue with MongoDB usage as backend.
Preferences:
- Possibility to pause/stop/kill/terminate running job without queue state losing
- Possibility to run one crawler job in parallel with several crawler instances using one queue. (Including adding/removing instances in runtime)
Installation
npm install --save mongodb-simplecrawler-queue
Usage
All you need is connection configuration: url of the MongoDB instance
const MongoDbQueue = MongoDbQueue; // or import { MongoDbQueue } from 'mongodb-simplecrawler-queue';const Crawler = ; const connectionConfig = url: 'mongodb://localhost:27017' dbName: 'crawler' collectionName: 'queue'; const crawlerQueue = connectionConfig; crawlerQueue;
Monitoring and garbage collector
It is possible to schedule the queue to execute monitoring tasks or garbage collector tasks periodically using queue configuration:
const connectionConfig = url: 'mongodb://localhost:27017' dbName: 'crawler' collectionName: 'queue' // Garbage collector configuration. GCConfig: run: true msInterval: 1000 * 60 * 5 // monitoring configuration monitorConfig: run: true statisticCollectionName: 'statistic' msInterval: 1000 * 60 ;
Garbage collector
Garbage collector is a task which will be executed periodically and find "old" items which have been spooled but not fetched. It could happen in case when crawler instance started to process new item but job failed/process has been terminated etc. So you don't have "missing" items in the queue.
Garbage collector config object properties:
Property name | Type | Comment |
---|---|---|
run | boolean | disable/enable GC |
msInterval | number | Interval between Monitoring tasks in milliseconds. |
Monitoring
Monitoring task is a task which will aggregate all statistic information about current queue state and put data in another collection of the MongoDB instance
Monitoring config object properties:
Property name | Type | Comment |
---|---|---|
run | boolean | disable/enable GC |
statisticCollectionName | String | [default = 'statistic'] Collection name to put data. Warning Please don't use same collections names for queue and statistic! It will break crawler logic and crawler job will never finish. |
Current aggregation data:
Queue monitoring will add statistic data items into statistic Collection each msInterval
milliseconds. So you can use this collection in runtime to see progress sof the job.
Example of the statistic item:
// Aggregation statistic from the queue items - see https://github.com/simplecrawler/simplecrawler#queue-statistics-and-reporting actualDataSizeAvg: 4064118656716418 actualDataSizeMax: 203492 actualDataSizeMin: 179 contentLengthAvg: 4064118656716418 contentLengthMax: 203492 contentLengthMin: 179 downloadTimeAvg: 3629634328358209 downloadTimeMax: 19584 downloadTimeMin: 2 requestLatencyAvg: 3493283582089552 requestLatencyMax: 265 requestLatencyMin: 22 requestTimeAvg: 36645671641791046 requestTimeMax: 19608 requestTimeMin: 25 // reduced count of elements grouped by QueueItem.status (see https://github.com/simplecrawler/simplecrawler#queue-items): queued: 37449 downloaded: 696 headers: 7 spooled: 2 downloadprevented: 1 timeout: 1 created: 1 failed: 1 notfound: 1 redirected: 1 pulled: 0 // general info - total count of items totalCount: 22952 // general info - count of items with property "fetched": true. Means items is fully processed fetchedCount: 134 // timestamp of the request timestamp: 1554661504815 // mongoDB id _id: "5caa418467fa3c00083b4b7a"
Note: if there are no elements with some QueueItem.status, this property will not be included in the statistic item.
WARNING: These GC and Monitoring tasks could be slowed down by parallel general crawler execution. For better performance, please use GC and Monitoring as separate utilities in separate process.
Additional utilities
You can run GCTasks, Monitoring as blocking operations using Utils class and runGC
and runMonitoring
methods.
Also you can fully drop the queue using dropQueue
method. Note: this method also cleans statistic collection.
// TODO examples
Performance
// TODO tests and results, some graphics
MongoDB as a docker image locally
Start mongo as a docker
docker run --name <IMAGE_NAME> -p 27017:27017 -d mongo:3.6.17-xenial
Connect to the running image
docker exec -it <IMAGE_NAME> bash