Thursday, January 21, 2010

how to speed up Heritrix

I figured out why the Heritrix crawler was running at one page per second.
It was configured it to run using a default Java VM size of 256m.

cat /etc/init.d/heritrix.sh
#!/bin/bash

/opt/heritrix/bin/heritrix --bind=yowb3 --admin=admin:admin

I changed this to 2048m and it seems to be running 10x faster

cat /etc/init.d/heritrix.sh
#!/bin/bash
export JAVA_OPTS=" -Xmx2048m"

/opt/heritrix/bin/heritrix --bind=yowb3 --admin=admin:admin

-----------------

Rates
9.55 URIs/sec (16.1 avg)
246 KB/sec (389 avg)

Load
6 active of 50 threads
1 congestion ratio

Thursday, January 7, 2010

Lucene index writes per minute slow down

28 million data records were indexed.
The write rate for the index was as follows:
Time Writes per minute
0-2 minutes 100,000
...
8 hours later 5,000

See graph-shows writes per second.