I figured out why the Heritrix crawler was running at one page per second.
It was configured it to run using a default Java VM size of 256m.
cat /etc/init.d/heritrix.sh
#!/bin/bash
/opt/heritrix/bin/heritrix --bind=yowb3 --admin=admin:admin
I changed this to 2048m and it seems to be running 10x faster
cat /etc/init.d/heritrix.sh
#!/bin/bash
export JAVA_OPTS=" -Xmx2048m"
/opt/heritrix/bin/heritrix --bind=yowb3 --admin=admin:admin
-----------------
Rates
9.55 URIs/sec (16.1 avg)
246 KB/sec (389 avg)
Load
6 active of 50 threads
1 congestion ratio
No comments:
Post a Comment