Hadoop setup
Important Directories
One of the basic tasks involved in setting up a Hadoop cluster is determining where the several various Hadoop-related directories will be located. Where they go is up to you; in some cases, the default locations are inadvisable and should be changed. This section identifies these directories.
Directory Description Default location Suggested location
HADOOP_LOG_DIR Output location for log files from daemons ${HADOOP_HOME}/logs /var/log/hadoop
hadoop.tmp.dir A base for other temporary directories /tmp/hadoop-${user.name} /tmp/hadoop
dfs.name.dir Where the NameNode metadata should be stored ${hadoop.tmp.dir}/dfs/name /home/hadoop/dfs/name
dfs.data.dir Where DataNodes store their blocks ${hadoop.tmp.dir}/dfs/data /home/hadoop/dfs/data
mapred.system.dir The in-HDFS path to shared MapReduce system files ${hadoop.tmp.dir}/mapred/system /hadoop/mapred/system
This table is not exhaustive; several other directories are listed in conf/hadoop-defaults.xml. The remaining directories, however, are initialized by default to reside under hadoop.tmp.dir, and are unlikely to be a concern.
It is critically important in a real cluster that dfs.name.dir and dfs.data.dir be moved out from hadoop.tmp.dir. A real cluster should never consider these directories temporary, as they are where all persistent HDFS data resides. Production clusters should have two paths listed for dfs.name.dir which are on two different physical file systems, to ensure that cluster metadata is preserved in the event of hardware failure.
Thursday, July 22, 2010
Tuesday, July 20, 2010
Monday, June 28, 2010
Upgrading from Mac Leopard to Snow Leopard-clean install-external HDD
1. Buy a WD Scorpio 320GB and put it in an external enclosure
2. Format the drive and USE A GUID partition
-Make a new partition on the external drive to hold the OS/applications -Mac \hdd
-Make additional partitions to hold videos...
3. Insert the Snow Leopard CD and reboot
4. Install Snow Leopard on the external hdd Mac \hdd partition
5. Reboot
6. Update the Mac software Menu->Apple->Software Update
7. Now copy your old login.keychain to the Mac \hdd/Volumes/Users//Library/Keychains
8. Use Keychain Access to create a new keychain file. Then quit Keychain Access. In a shell, copy the old keychain file over the newly created on.
9. Enable root access: http://support.apple.com/kb/ht1528
10. Follow the instructions on this page except ignore the Keychain restoration procedure. Apple personal information transfer instructions
2. Format the drive and USE A GUID partition
-Make a new partition on the external drive to hold the OS/applications -Mac \hdd
-Make additional partitions to hold videos...
3. Insert the Snow Leopard CD and reboot
4. Install Snow Leopard on the external hdd Mac \hdd partition
5. Reboot
6. Update the Mac software Menu->Apple->Software Update
7. Now copy your old login.keychain to the Mac \hdd/Volumes/Users/
8. Use Keychain Access to create a new keychain file. Then quit Keychain Access. In a shell, copy the old keychain file over the newly created on.
9. Enable root access: http://support.apple.com/kb/ht1528
10. Follow the instructions on this page except ignore the Keychain restoration procedure. Apple personal information transfer instructions
Friday, June 4, 2010
Selecting the right HDD for large data applications
Selecting the right HDD is about more than just getting a good deal at Frys. Not all hdd's are created equal.
Caviar black discussion including motor load and spindles
Drive specs including platter sizes
Caviar black discussion including motor load and spindles
Drive specs including platter sizes
Friday, April 23, 2010
finding the match boundaries in a Perl regex
Perl FAQ
"Since Perl 5.6.1 the special variables @- and @+ can functionally replace $`, $& and $'. These arrays contain pointers to the beginning and end of each match (see perlvar for the full story), so they give you essentially the same information, but without the risk of excessive string copying."
Regex-Related Special Variables
Perl has a host of special variables that get filled after every m// or s/// regex match. $1, $2, $3, etc. hold the backreferences. $+ holds the last (highest-numbered) backreference. $& (dollar ampersand) holds the entire regex match.
@- is an array of match-start indices into the string. $-[0] holds the start of the entire regex match, $-[1] the start of the first backreference, etc. Likewise, @+ holds match-end indices (ends, not lengths).
$' (dollar followed by an apostrophe or single quote) holds the part of the string after (to the right of) the regex match. $` (dollar backtick) holds the part of the string before (to the left of) the regex match. Using these variables is not recommended in scripts when performance matters, as it causes Perl to slow down all regex matches in your entire script.
All these variables are read-only, and persist until the next regex match is attempted. They are dynamically scoped, as if they had an implicit 'local' at the start of the enclosing scope. Thus if you do a regex match, and call a sub that does a regex match, when that sub returns, your variables are still set as they were for the first match.
if ($lineCopy =~ /$joinedColumns/g) {
my $start = @+[0]; # match start index stored in position 0 in the array
print "MATCH: Found '$&'. lineCopy= " . $lineCopy . "\n";
print "MATCH: atminux = @- atplus= @+\n";
# print "MATCH: Next attempt at character " . pos($lineCopy) + 1 . "\n";
}
else {
print "NO MATCH: line = $lineCopy joinedColumns = $joinedColumns\n";
}
MATCH: Found 'attachments,grinder attachments'. lineCopy= tools,attachments,grinder attachments
MATCH: atminux = 6 atplus= 37
NO MATCH: line = tools,attachments,hammer \& hammer drill attachments joinedColumns = attachments,hammer\ \&\ hammer\ drill\ attachments
MATCH: Found 'attachments,jig saw attachments'. lineCopy= tools,attachments,jig saw attachments
MATCH: atminux = 6 atplus= 37
MATCH: Found 'attachments,metal case'. lineCopy= tools,attachments,metal case
MATCH: atminux = 6 atplus= 28
MATCH: Found 'attachments,miter saw attachments'. lineCopy= tools,attachments,miter saw attachments
MATCH: atminux = 6 atplus= 39
MATCH: Found 'attachments,nibbler attachments'. lineCopy= tools,attachments,nibbler attachments
MATCH: atminux = 6 atplus= 37
"Since Perl 5.6.1 the special variables @- and @+ can functionally replace $`, $& and $'. These arrays contain pointers to the beginning and end of each match (see perlvar for the full story), so they give you essentially the same information, but without the risk of excessive string copying."
Regex-Related Special Variables
Perl has a host of special variables that get filled after every m// or s/// regex match. $1, $2, $3, etc. hold the backreferences. $+ holds the last (highest-numbered) backreference. $& (dollar ampersand) holds the entire regex match.
@- is an array of match-start indices into the string. $-[0] holds the start of the entire regex match, $-[1] the start of the first backreference, etc. Likewise, @+ holds match-end indices (ends, not lengths).
$' (dollar followed by an apostrophe or single quote) holds the part of the string after (to the right of) the regex match. $` (dollar backtick) holds the part of the string before (to the left of) the regex match. Using these variables is not recommended in scripts when performance matters, as it causes Perl to slow down all regex matches in your entire script.
All these variables are read-only, and persist until the next regex match is attempted. They are dynamically scoped, as if they had an implicit 'local' at the start of the enclosing scope. Thus if you do a regex match, and call a sub that does a regex match, when that sub returns, your variables are still set as they were for the first match.
if ($lineCopy =~ /$joinedColumns/g) {
my $start = @+[0]; # match start index stored in position 0 in the array
print "MATCH: Found '$&'. lineCopy= " . $lineCopy . "\n";
print "MATCH: atminux = @- atplus= @+\n";
# print "MATCH: Next attempt at character " . pos($lineCopy) + 1 . "\n";
}
else {
print "NO MATCH: line = $lineCopy joinedColumns = $joinedColumns\n";
}
MATCH: Found 'attachments,grinder attachments'. lineCopy= tools,attachments,grinder attachments
MATCH: atminux = 6 atplus= 37
NO MATCH: line = tools,attachments,hammer \& hammer drill attachments joinedColumns = attachments,hammer\ \&\ hammer\ drill\ attachments
MATCH: Found 'attachments,jig saw attachments'. lineCopy= tools,attachments,jig saw attachments
MATCH: atminux = 6 atplus= 37
MATCH: Found 'attachments,metal case'. lineCopy= tools,attachments,metal case
MATCH: atminux = 6 atplus= 28
MATCH: Found 'attachments,miter saw attachments'. lineCopy= tools,attachments,miter saw attachments
MATCH: atminux = 6 atplus= 39
MATCH: Found 'attachments,nibbler attachments'. lineCopy= tools,attachments,nibbler attachments
MATCH: atminux = 6 atplus= 37
Friday, April 16, 2010
TRAC installation including trac HTML form based authentication
trac-admin /home/trac/yo_web_services initenv
chown -R apache.apache /home/svn/yo_web_services
chown -R apache.apache /home/trac/yo_web_services
vim /etc/httpd/conf.d/trac.conf
>>
SetHandler mod_python
PythonHandler trac.web.modpython_frontend
PythonOption TracEnv /home/trac/yo_web_services
PythonOption TracUriRoot /trac/yo_web_services
AuthType Basic
AuthName "trac"
AuthUserFile /home/trac/trac.htpasswd
# comment the next line if using HTML form based login using the trac plugins
# per the trac-hacks page
# Require valid-user
<< touch /home/trac/yo_web_services.htpasswd #Add users to password file htpasswd -m /home/trac/yo_web_services.htpasswd
trac-admin /home/trac/yo_web_services permission add TRAC_ADMIN
service httpd restart
Add the plugins from this page
http://trac-hacks.org/wiki/AccountManagerPlugin
chown -R apache.apache /home/svn/yo_web_services
chown -R apache.apache /home/trac/yo_web_services
vim /etc/httpd/conf.d/trac.conf
>>
SetHandler mod_python
PythonHandler trac.web.modpython_frontend
PythonOption TracEnv /home/trac/yo_web_services
PythonOption TracUriRoot /trac/yo_web_services
AuthType Basic
AuthName "trac"
AuthUserFile /home/trac/trac.htpasswd
# comment the next line if using HTML form based login using the trac plugins
# per the trac-hacks page
# Require valid-user
<< touch /home/trac/yo_web_services.htpasswd #Add users to password file htpasswd -m /home/trac/yo_web_services.htpasswd
trac-admin /home/trac/yo_web_services permission add
service httpd restart
Add the plugins from this page
http://trac-hacks.org/wiki/AccountManagerPlugin
Thursday, April 15, 2010
Thursday, March 18, 2010
Wednesday, March 17, 2010
Monday, March 8, 2010
Fedora 12 Cloudera Hadoop setup + Java JDK
Cloudera's Hadoop distribution
When installing Cloudera's Hadoop distribution on Fedora 12 make sure you install
the Sun Java SDK using the method recommended below.
Sun Java
Fedora Java installation
# yum install hadoop
Loaded plugins: presto, refresh-packagekit
Setting up Install Process
Resolving Dependencies
--> Running transaction check
---> Package hadoop.noarch 0:0.18.3-14.cloudera.CH0_3 set to be updated
--> Processing Dependency: jdk >= 1.6 for package: hadoop-0.18.3-14.cloudera.CH0_3.noarch
--> Finished Dependency Resolution
hadoop-0.18.3-14.cloudera.CH0_3.noarch from cloudera-stable has depsolving problems
--> Missing Dependency: jdk >= 1.6 is needed by package hadoop-0.18.3-14.cloudera.CH0_3.noarch (cloudera-stable)
Error: Missing Dependency: jdk >= 1.6 is needed by package hadoop-0.18.3-14.cloudera.CH0_3.noarch (cloudera-stable)
You could try using --skip-broken to work around the problem
You could try running: package-cleanup --problems
package-cleanup --dupes
rpm -Va --nofiles --nodigest
Cloudera RPM Java installation to avoid the yum install dep problem
When installing Cloudera's Hadoop distribution on Fedora 12 make sure you install
the Sun Java SDK using the method recommended below.
Sun Java
Fedora Java installation
# yum install hadoop
Loaded plugins: presto, refresh-packagekit
Setting up Install Process
Resolving Dependencies
--> Running transaction check
---> Package hadoop.noarch 0:0.18.3-14.cloudera.CH0_3 set to be updated
--> Processing Dependency: jdk >= 1.6 for package: hadoop-0.18.3-14.cloudera.CH0_3.noarch
--> Finished Dependency Resolution
hadoop-0.18.3-14.cloudera.CH0_3.noarch from cloudera-stable has depsolving problems
--> Missing Dependency: jdk >= 1.6 is needed by package hadoop-0.18.3-14.cloudera.CH0_3.noarch (cloudera-stable)
Error: Missing Dependency: jdk >= 1.6 is needed by package hadoop-0.18.3-14.cloudera.CH0_3.noarch (cloudera-stable)
You could try using --skip-broken to work around the problem
You could try running: package-cleanup --problems
package-cleanup --dupes
rpm -Va --nofiles --nodigest
Cloudera RPM Java installation to avoid the yum install dep problem
Thursday, January 21, 2010
how to speed up Heritrix
I figured out why the Heritrix crawler was running at one page per second.
It was configured it to run using a default Java VM size of 256m.
cat /etc/init.d/heritrix.sh
#!/bin/bash
/opt/heritrix/bin/heritrix --bind=yowb3 --admin=admin:admin
I changed this to 2048m and it seems to be running 10x faster
cat /etc/init.d/heritrix.sh
#!/bin/bash
export JAVA_OPTS=" -Xmx2048m"
/opt/heritrix/bin/heritrix --bind=yowb3 --admin=admin:admin
-----------------
Rates
9.55 URIs/sec (16.1 avg)
246 KB/sec (389 avg)
Load
6 active of 50 threads
1 congestion ratio
It was configured it to run using a default Java VM size of 256m.
cat /etc/init.d/heritrix.sh
#!/bin/bash
/opt/heritrix/bin/heritrix --bind=yowb3 --admin=admin:admin
I changed this to 2048m and it seems to be running 10x faster
cat /etc/init.d/heritrix.sh
#!/bin/bash
export JAVA_OPTS=" -Xmx2048m"
/opt/heritrix/bin/heritrix --bind=yowb3 --admin=admin:admin
-----------------
Rates
9.55 URIs/sec (16.1 avg)
246 KB/sec (389 avg)
Load
6 active of 50 threads
1 congestion ratio
Thursday, January 7, 2010
Lucene index writes per minute slow down
Sunday, January 3, 2010
Drupal/LAMP installation on Ubuntu
Install XAMPP (LAMP) and DRUPAL on Ubuntu
Old notes below:
1. Install LAMP
XAMPP install made easy-use the instructions on this site to install the LAMP stack
2. Install DRUPAL
Reset mysql password if necessary.
http://en.kioskea.net/faq/sujet-630-reinitializing-the-root-password-of-mysql
Install DRUPAL on Ubuntu
Alternate installation instructions with notes on security and important files
Old notes below:
1. Install LAMP
XAMPP install made easy-use the instructions on this site to install the LAMP stack
2. Install DRUPAL
Reset mysql password if necessary.
http://en.kioskea.net/faq/sujet-630-reinitializing-the-root-password-of-mysql
Install DRUPAL on Ubuntu
Alternate installation instructions with notes on security and important files
Subscribe to:
Posts (Atom)