Thursday, July 22, 2010

Hadoop cluster setup

Hadoop setup
Important Directories
One of the basic tasks involved in setting up a Hadoop cluster is determining where the several various Hadoop-related directories will be located. Where they go is up to you; in some cases, the default locations are inadvisable and should be changed. This section identifies these directories.
Directory Description Default location Suggested location
HADOOP_LOG_DIR Output location for log files from daemons ${HADOOP_HOME}/logs /var/log/hadoop
hadoop.tmp.dir A base for other temporary directories /tmp/hadoop-${user.name} /tmp/hadoop
dfs.name.dir Where the NameNode metadata should be stored ${hadoop.tmp.dir}/dfs/name /home/hadoop/dfs/name
dfs.data.dir Where DataNodes store their blocks ${hadoop.tmp.dir}/dfs/data /home/hadoop/dfs/data
mapred.system.dir The in-HDFS path to shared MapReduce system files ${hadoop.tmp.dir}/mapred/system /hadoop/mapred/system
This table is not exhaustive; several other directories are listed in conf/hadoop-defaults.xml. The remaining directories, however, are initialized by default to reside under hadoop.tmp.dir, and are unlikely to be a concern.
It is critically important in a real cluster that dfs.name.dir and dfs.data.dir be moved out from hadoop.tmp.dir. A real cluster should never consider these directories temporary, as they are where all persistent HDFS data resides. Production clusters should have two paths listed for dfs.name.dir which are on two different physical file systems, to ensure that cluster metadata is preserved in the event of hardware failure.

Tuesday, July 20, 2010

Hadoop installation instructions from IBM

Hadoop installation instructions part 1
Hadoop installation instructions part 2

Monday, June 28, 2010

Upgrading from Mac Leopard to Snow Leopard-clean install-external HDD

1. Buy a WD Scorpio 320GB and put it in an external enclosure
2. Format the drive and USE A GUID partition
-Make a new partition on the external drive to hold the OS/applications -Mac \hdd
-Make additional partitions to hold videos...
3. Insert the Snow Leopard CD and reboot
4. Install Snow Leopard on the external hdd Mac \hdd partition
5. Reboot
6. Update the Mac software Menu->Apple->Software Update
7. Now copy your old login.keychain to the Mac \hdd/Volumes/Users//Library/Keychains
8. Use Keychain Access to create a new keychain file. Then quit Keychain Access. In a shell, copy the old keychain file over the newly created on.
9. Enable root access: http://support.apple.com/kb/ht1528
10. Follow the instructions on this page except ignore the Keychain restoration procedure. Apple personal information transfer instructions

Friday, June 4, 2010

Selecting the right HDD for large data applications

Selecting the right HDD is about more than just getting a good deal at Frys. Not all hdd's are created equal.
Caviar black discussion including motor load and spindles


Drive specs including platter sizes

Friday, April 23, 2010

finding the match boundaries in a Perl regex

Perl FAQ
"Since Perl 5.6.1 the special variables @- and @+ can functionally replace $`, $& and $'. These arrays contain pointers to the beginning and end of each match (see perlvar for the full story), so they give you essentially the same information, but without the risk of excessive string copying."


Regex-Related Special Variables

Perl has a host of special variables that get filled after every m// or s/// regex match. $1, $2, $3, etc. hold the backreferences. $+ holds the last (highest-numbered) backreference. $& (dollar ampersand) holds the entire regex match.

@- is an array of match-start indices into the string. $-[0] holds the start of the entire regex match, $-[1] the start of the first backreference, etc. Likewise, @+ holds match-end indices (ends, not lengths).

$' (dollar followed by an apostrophe or single quote) holds the part of the string after (to the right of) the regex match. $` (dollar backtick) holds the part of the string before (to the left of) the regex match. Using these variables is not recommended in scripts when performance matters, as it causes Perl to slow down all regex matches in your entire script.

All these variables are read-only, and persist until the next regex match is attempted. They are dynamically scoped, as if they had an implicit 'local' at the start of the enclosing scope. Thus if you do a regex match, and call a sub that does a regex match, when that sub returns, your variables are still set as they were for the first match.


if ($lineCopy =~ /$joinedColumns/g) {

my $start = @+[0]; # match start index stored in position 0 in the array

print "MATCH: Found '$&'. lineCopy= " . $lineCopy . "\n";

print "MATCH: atminux = @- atplus= @+\n";
# print "MATCH: Next attempt at character " . pos($lineCopy) + 1 . "\n";
}
else {
print "NO MATCH: line = $lineCopy joinedColumns = $joinedColumns\n";
}

MATCH: Found 'attachments,grinder attachments'. lineCopy= tools,attachments,grinder attachments
MATCH: atminux = 6 atplus= 37
NO MATCH: line = tools,attachments,hammer \& hammer drill attachments joinedColumns = attachments,hammer\ \&\ hammer\ drill\ attachments
MATCH: Found 'attachments,jig saw attachments'. lineCopy= tools,attachments,jig saw attachments
MATCH: atminux = 6 atplus= 37
MATCH: Found 'attachments,metal case'. lineCopy= tools,attachments,metal case
MATCH: atminux = 6 atplus= 28
MATCH: Found 'attachments,miter saw attachments'. lineCopy= tools,attachments,miter saw attachments
MATCH: atminux = 6 atplus= 39
MATCH: Found 'attachments,nibbler attachments'. lineCopy= tools,attachments,nibbler attachments
MATCH: atminux = 6 atplus= 37

Friday, April 16, 2010

TRAC installation including trac HTML form based authentication

trac-admin /home/trac/yo_web_services initenv
chown -R apache.apache /home/svn/yo_web_services
chown -R apache.apache /home/trac/yo_web_services

vim /etc/httpd/conf.d/trac.conf
>>

SetHandler mod_python
PythonHandler trac.web.modpython_frontend
PythonOption TracEnv /home/trac/yo_web_services
PythonOption TracUriRoot /trac/yo_web_services



AuthType Basic
AuthName "trac"
AuthUserFile /home/trac/trac.htpasswd
# comment the next line if using HTML form based login using the trac plugins
# per the trac-hacks page
# Require valid-user


<< touch /home/trac/yo_web_services.htpasswd #Add users to password file htpasswd -m /home/trac/yo_web_services.htpasswd
trac-admin /home/trac/yo_web_services permission add TRAC_ADMIN

service httpd restart

Add the plugins from this page
http://trac-hacks.org/wiki/AccountManagerPlugin

Thursday, April 15, 2010

Mac OSX X11 fix

make sure /usr/X11R6 is empty
cd /usr
ln -s X11R6 X11
now the dylibs will be found...

Monday, March 8, 2010

Fedora 12 Cloudera Hadoop setup + Java JDK

Cloudera's Hadoop distribution

When installing Cloudera's Hadoop distribution on Fedora 12 make sure you install
the Sun Java SDK using the method recommended below.




Sun Java


Fedora Java installation
# yum install hadoop
Loaded plugins: presto, refresh-packagekit
Setting up Install Process
Resolving Dependencies
--> Running transaction check
---> Package hadoop.noarch 0:0.18.3-14.cloudera.CH0_3 set to be updated
--> Processing Dependency: jdk >= 1.6 for package: hadoop-0.18.3-14.cloudera.CH0_3.noarch
--> Finished Dependency Resolution
hadoop-0.18.3-14.cloudera.CH0_3.noarch from cloudera-stable has depsolving problems
--> Missing Dependency: jdk >= 1.6 is needed by package hadoop-0.18.3-14.cloudera.CH0_3.noarch (cloudera-stable)
Error: Missing Dependency: jdk >= 1.6 is needed by package hadoop-0.18.3-14.cloudera.CH0_3.noarch (cloudera-stable)
You could try using --skip-broken to work around the problem
You could try running: package-cleanup --problems
package-cleanup --dupes
rpm -Va --nofiles --nodigest

Cloudera RPM Java installation to avoid the yum install dep problem

Thursday, January 21, 2010

how to speed up Heritrix

I figured out why the Heritrix crawler was running at one page per second.
It was configured it to run using a default Java VM size of 256m.

cat /etc/init.d/heritrix.sh
#!/bin/bash

/opt/heritrix/bin/heritrix --bind=yowb3 --admin=admin:admin

I changed this to 2048m and it seems to be running 10x faster

cat /etc/init.d/heritrix.sh
#!/bin/bash
export JAVA_OPTS=" -Xmx2048m"

/opt/heritrix/bin/heritrix --bind=yowb3 --admin=admin:admin

-----------------

Rates
9.55 URIs/sec (16.1 avg)
246 KB/sec (389 avg)

Load
6 active of 50 threads
1 congestion ratio

Thursday, January 7, 2010

Lucene index writes per minute slow down

28 million data records were indexed.
The write rate for the index was as follows:
Time Writes per minute
0-2 minutes 100,000
...
8 hours later 5,000

See graph-shows writes per second.