OSG Operational Services

Wednesday, September 24, 2014

GOC Special Special Maintenance Window - Tuesday, September 30th at 13:00 UTC

The GOC would like to announce a special maintenance window for September 30th, the 5th Tuesday of September.

Beginning at 13:00 UTC (9:00 a.m. EDT), OSG Operations will be performing maintenance on the OSG Twiki and JIRA services, as well as the Planet OSG blog aggregator.  The VM host for these services has been upgraded and they will now be moved back to this upgraded host.  The services will not be down at the same time, and no service should be down for longer than 30 minutes.

Additional maintenance will be performed on machines used internally for the operation of all services. These internal changes should be transparent to users. The GOC reserves 8 hours for these activities.

Tuesday, September 16, 2014

GOC Service Update - Tuesday, September 23rd at 13:00 UTC

The GOC will upgrade the following services beginning Tuesday, September 23rd, 2014 at 13:00 UTC. The GOC reserves 8 hours in the unlikely event that unexpected problems are encountered.

LVS Cluster
Upgrading LVS instances, lvs1 to CentOS6 during this cycle.

AMQP Cluster (event.grid)
Upgrading RabbitMQ RPMs to the latest version (3.1.5)

OIM 3.36
Add ldapsearch utilities
(patched) made it possible to remove mesh config / configs and tests
Moved various aux. scripts used by OIM under oim repository.
Fixed an issue where host certificate expiration notifier was broken due to missing requester_id. Using requeser_name instead which is always set.
oim cron scripts / Added check for real service return code to generate goc alert in case of server error.
Patched the issue where newly entered meshconfig records go missing. Updated meshconfig admin to allow multiple hostgroup to be used per each member group.
Added new oasis_repo_urls field in VO. Refactored alias editor used by resource form and implemented host editor used by both resource form and new VO/oasis url (OIM-104)

OSG Homepage
Upgrading to Wordpress v4.

GOC Ticket 1.81
Fixed an issue where assignee selector malfunctions sometimes (TICKET-102)
Reorganized application configuration.

MyOSG 2.27
Added oasis_repo_urls indicator (OIM-104)
Reorganized configuration file for eventjs.
Updated mesh config publisher to allow multiple host groups to be used per each member.

Oasis
Created SL6 login node per request (https://ticket.grid.iu.edu/22409)

All Services
Operating system updates, reboots will be required. The usual HA mechanisms will be used but there will be short outages.

Tuesday, September 9, 2014

Announcing OSG Software versions 3.2.15 and 3.1.39

We are pleased to announce OSG Software versions 3.2.15 and 3.1.39.

OSG 3.2.15 contains:
* XRootD updated to version 4.0.0
* GlideinWMS 3.2.6 (update from version 3.2.5.1)
* HTCondor CE 1.5.1 bug fix for idle jobs put on hold
* HTCondor 8.2.2 in the upcoming repository
* Patch for HTCondor 8.0.7 to fix bug with AWS EC2 integration

OSG 3.2.15 ang 3.1.39 contain:
* Update dCache SRM client to version 2.2.27
* minor fixes to the OSG PKI tools
* fix for LSF script in the blahp package

Release notes and pointers to more documentation can be found at:

https://www.opensciencegrid.org/bin/view/Documentation/Release3/Release3215
https://www.opensciencegrid.org/bin/view/Documentation/Release3/Release3139

Need help? Let us know:

https://www.opensciencegrid.org/bin/view/Documentation/Release3/HelpProcedure

We welcome feedback on this release!

OSG Connect Services Restored

At approximately 1:50 pm CDT (18:50 GMT) a switch management card at University of Chicago failed, taking down a portion of the Science DMZ that serves US ATLAS Midwest Tier 2, ATLAS Connect, OSG Connect and other CI Connect services, UChicago ATLAS Tier 3, and a server for the South Pole Telescope. System and network engineers responded immediately. As of 3:00 pm CDT (20:00 GMT), all systems are back online and operating at full speed/capacity. The failed management card is being investigated by UC network staff and the vendor. No further interruptions are expected.

OSG Connect Service Unavailable

At 18:52 GMT today, an outage at the University of Chicago caused all OSG Connect services to become unavailable. This outage is ongoing and system engineers are investigating. We will provide updates as they become available.

Tuesday, September 2, 2014

GOC Service Update - Tuesday, September 9th at 13:00 UTC

The GOC will upgrade the following services beginning Tuesday, September 9th at 13:00 UTC. The GOC reserves 8 hours in the unlikely event that unexpected problems are encountered.

MyOSG 2.26
Modifications to the operations status page (tentative). An internal modification to a display of OSG-COG services status.
Updated the responsiveness of rgstatushistory graphs.
Updated pfmesh to use perfsonar_mas and other meshconfig related tables introduced by OIM 3.35.

GOC Ticket 1.80
Added gsiftp/srm to list of url-linked target that are converted to on ticket viewer. Added dedup for replace email/url so that same replacement won't corrupt the page.
Fixed issue where filename containing &ampersand will break the attachment interface (TICKET-99)
Updated dlog to use the mysql style static output file.
Fixed issue where assignee list in ticket list are displaying empty lines. Other minor style adjustments; made "security" label to be red, etc..
Fixed: ticket exchange foreign ID is not displayed nicely for non-anchor id

LVS
Rebuilding one of the lvs instance (lvs2) to run on CentOS 6. We’d like to make sure lvs2 will function properly before switching lvs1 to CentOS6 also, probably during the next-next release.

OIM 3.35
Updated DomainNameValidator to prevent entering invalid GridAdmin domain (OSGPKI-89)
Updated CN validation rule to prevent double space (OIM-105)
Added a module which crawls all perfsonar endpoints and collect available MA endpoints and store them on perfsonar_mas table to be used by MyOSG/mesh config publisher.
Added a bit more debug log to help troubleshoot the staling Digicert certificate issuing problem.
Made FOS list accessible by guest as readonly view. Organized primary/secondary VO and project into different columns. Also added CERT only label.
Added a module called wlcgloader which downloads wlcg sites/endpoints from GOCDB and synchronize them in OIM periodically.
Implemented mesh config admin page.

VM Host Upgrade
Updating vm06.grid.iu.edu, one of our remaining RHEL5/VMware Server 1.10 hosts, to RHEL6/KVM (which most of our VM hosts use)
Converting guest disk images, not rebuilding
Will affect software2, data2, repo2, and rsvprocess2 – will use LVS to shunt traffic to the other instance in the case of software1 and repo1, so users will notice no downtime, only possibly some degradation of service, but rsvprocess2 and data2 will be down
This will leave only two RHEL5/VMware Server 1.10 hosts to upgrade (vm02 and vm04)

Monitor
monitor.grid.iu.edu will be rebuilt with RHEL6 GOC stemcell. monitor.grid.iu.edu is GOC’s internal service used to monitor our services and provide various other internal services like gocbot.

GridFTP-HDFS Corruption Issue Workaround

Some sites in OSG have observed data corruption when transferring files with GridFTP-HDFS. In particular, the problem arises when pthreads is enabled (by setting GLOBUS_THREAD_MODEL="pthread") and GridFTP is using *single stream* transfers that span multiple HDFS blocks. In this condition, blocks may be written to the destination file in the wrong sequence. A few sites using GridFTP-HDFS have reported failures, including Fermilab and GLOW.

This issue affects the OSG gridftp-hdfs package, versions 0.5.4-14 and newer, because they have pthreads enabled by default.



DETAILS

Transfers that use parallelism (we tried from 2-10 streams) and single stream transfers that only span a single HDFS block seem to be fine.

Transfers using a single stream but spanning multiple (3 or more) HDFS blocks result in the correct size, but usually the wrong checksum at the destination. The issue was reported originally by a remote user transferring via srm-copy, and OSG testing has observed the same failures using local globus-url-copy tools.

The issue has been reported to the Globus GridFTP developers, but there is no fix yet. But see below for possible workarounds.



WORKAROUNDS

On the server side, the problem can be avoided by disabling pthreads. To do this step, comment out the following line in /etc/gridftp.d/gridftp-hdfs.conf:

# $GLOBUS_THREAD_MODEL pthread

Until the bug is fixed, we recommend making this change on all gridftp-hdfs servers 0.5.4-14 and newer.

Alternatively, if pthreads cannot be disabled for the GridFTP server, it is sufficient on the client side to run globus-url-copy with parallelism greater than 1. For example:

globus-url-copy -p 2 gsiftp://$host:2811/path/file.in file:///path/file.out

FOR MORE INFO

https://globus.atlassian.net/browse/GT-547
https://jira.opensciencegrid.org/browse/SOFTWARE-1495
https://ticket.grid.iu.edu/21825
https://ticket.grid.iu.edu/21157

Gratiaweb unavailable

Gratiaweb (gratiaweb.opensciencegrid.org) is currently unavailable. OSG Operations is aware of the situation and is looking into solutions. We apologize for any inconvenience.