Notes intended for internal reference on server issues.
Adding a listserv
As root user:
copied existing listserv files under /usr2/home/majordomo ('nodes' listserv for example) to new ('hypoxia' for example). Changed permissions similar to other listservs. Removed email references to just myself for /usr2/home/majordomo/lists/hypoxia and /usr2/home/majordomo/lists/hypoxia_post
[root]# ls -l /usr2/home/majordomo/lists/hypoxia*
-rw-r--r-- 1 majordomo daemon 20 Jul 8 16:40 /usr2/home/majordomo/lists/hypoxia
-rw-r----- 1 majordomo daemon 16498 Jul 8 16:35 /usr2/home/majordomo/lists/hypoxia.config
-rw-r--r-- 1 majordomo daemon 20 Jul 8 16:41 /usr2/home/majordomo/lists/hypoxia_post
-rw-r----- 1 majordomo daemon 16541 Jul 8 16:35 /usr2/home/majordomo/lists/hypoxia_post.config
/usr2/home/majordomo/lists/hypoxia.archive:
total 4
-rw-rw-r-- 1 majordomo daemon 1660 Jul 8 17:07 hypoxia.0507
Used perl search and replace to swap references to 'nodes' to 'hypoxia'
[root]# perl -pi -e 's/nodes/hypoxia/g' hypox*
Copied 'nodes' section of /etc/aliases file for 'hypoxia'
hypoxia: "|/usr2/home/majordomo/wrapper resend -h caro-coops.org -l hypoxia hypoxia-outgoing",hypoxia-archivehypoxia-archive: "|/usr2/home/majordomo/wrapper archive2.pl -f /usr2/home/majordomo/lists/hypoxia.archive/hypoxia -m -a"
hypoxia-outgoing: :include: /usr2/home/majordomo/lists/hypoxia
owner-hypoxia: pseal
hypoxia-request: "|/usr2/home/majordomo/wrapper request-answer hypoxia"
hypoxia-approval: pseal
Refresh alias references so new aliases are active
[root]# newaliases
Under /var/www/cgi-bin/.wilma copy .cf, .rc files similar and replace 'nodes' references using perl search and replace again
-rw-r--r-- 1 root root 3954 Jul 8 17:16 hypoxia.cf
-rw-r--r-- 1 root root 2862 Jul 8 17:16 hypoxia.rc
[root]# perl -pi -e 's/nodes/hypoxia/g' hypox*
Add archive directory folder at /var/www/html/misc/pub_listserv_archives/hypoxia
From /var/www/cgi-bin run
[root]# perl wilma_reindex hypoxia
Should be able to send a test message to the hypoxia listserv and be notified of the post as well as finding it in the listserv archives at
http://caro-coops.org/cgi-bin/wilma/hypoxia
Making a listserv private
If the listserv policy is to be private:
- change the subscribe policy to 'closed' in the .config file for the listserv. For example, change the setting in /usr2/home/majordomo/lists/hypoxia.config)
subscribe_policy = closed
- change the policy references in the .config file where set to 'open' to 'list'
- remove the references for the archive listings .cf and .rc files under /var/www/cgi-bin/.wilma
Cron schedule for removing old files
substitute cron schedule, find expression and number of minutes as needed
# Remove all NDBC data files older than 24 hours (1440 minutes)
1 4 * * * find /usr2/prod/buoys/noaa/perl/data -maxdepth 1 -amin +1440 -exec rm -f {} \;
Server/disk failure notes
Remember to watch nautilus and other systems for new/dead processes and locks which get hung when nemo goes down. Check system loads with 'top' command.
ok to hot swap bay drive if amber
##begin
#don't need to perform the following steps, run danger of accidentally reconfiguring system
#try just regular reboot first to see what happens
#normal reboot takes around 10-20 minutes, if hard drives are completely rebuilt can take several hours
ctrl-a during bootup to get to scsi management
disk utilities - verify say 5% on each disk
look back at containers to see if any members missing, system should be scrubbing(rebuilding)
Feb 29, 2008 (2 amber lights, bay 1,4 - /usr2 (410 GB) is dead)
ctrl-a during bootup to get to scsi management
goto 'Manage Containers'
select 'dead' container
Ctrl-R to restore (will get ominous message about losing data, but will start scrubbing restoring once chosen and missing partitions should re-appear.) May take several hours to restore.
also note that lost keyboard response for some reason, unplugging/replugging KVM switch fixed
##end
server stats (checking conditions prior to failure) are located on nemo under
/usr2/home/jcothran/public_html/stats
A subdirectory called 'hung' has files/stats relating to earlier hangs.
links of interest
google search terms:
dell poweredge disk fail
dell poweredge amber rebuild
remote directory mounting - possible source of crashes?
http://www.ma.utexas.edu/users/stirling/computergeek/lufs.html
http://mss.net/Links/DellDocs/PowerEdge2650/InstallTroublesh/5g375c50.htm
Below link very similar to my support experience, once you get to an OS prompt(not a hardware issue) you're on your own
http://lists.us.dell.com/pipermail/linux-poweredge/2004-April/013950.html
July 14, 2005
Nemo is hung. No ssh and head monitor at server is blank.
Bay #5 was flashing amber indicating possible problem with drive, bay or controller. Hot swapped amber bay with same type spare drive. Waited 20 minutes to see if system would restore console, but monitor still blank. Power off/on and system came back up with a 'rebuilding' status scrolling by when referencing the partitions. System seems to be working fine.
Restoring remount mounts after reboot
In the event that all power is lost and restored to the facilities or some other scenario where all the servers are powered off/on, make sure that the remount mounts to the other servers and samba mount point(external hard drives) are restored.
After all the servers and workstations are restored(and remote drives are available), as root do
mount -a
to reload the mounts listed under /etc/fstab for each server.
For samba remote mounts used on nautilus, nemo and trident, there is a different step used - locate the file rc.local, should be under /etc/rc.local
As root, copy and run the samba commands found in the rc.local file like below
/usr/sbin/smbmount //blazer.asg.sc.edu/blazer_backup /blazer_backup -o username=backup,password=b@ckup,workgroup=BLAZER
/usr/sbin/smbmount //accord.asg.sc.edu/accord_backup /accord_backup -o username=backup,password=b@ckup,workgroup=ACCORD
which should run the necessary command in this file to restore the samba mount.
Important Note:Be sure to check for and remove files(which will consume available server space and bring the server to a halt if left unmounted for too long) which may have incorrectly been copied to the placeholder samba directory like /blazer_backup or /accord_backup while the remote mounts were unavailable.
--
JeremyCothran - 08 Jul 2005
to top