Annex – Continuity of Service Plan for the LRIT System

1 Introduction

1.1 The Long-Range Identification and Tracking (LRIT) system, which provides for the global identification and tracking of ships, consists of the shipborne LRIT information transmitting equipment, the Communication Service Provider(s) (CSPs), the Application Service Provider(s) (ASPs), the LRIT Data Centre(s) (DCs), including any related Vessel Monitoring System(s) (VMSs), the LRIT Data Distribution Plan (DDP), and the International LRIT Data Exchange (IDE). For the LRIT system to operate efficiently, all components of the LRIT system need to work seamlessly together to ensure the end-to-end transmission of messages between DCs requesting and providing LRIT information.

1.2 The provisions of SOLAS regulation V/19-1, the Revised performance standards and functional requirements for the long-range identification and tracking (LRIT) of ships (the Revised performance standards, adopted by resolution MSC.263(84), as amended, and the Technical specifications for the LRIT system (MSC.1/Circ.1259/Rev.6) include a number of performance expectations of system components and thus of the LRIT system as a whole.

1.3 LRIT information is provided to SOLAS Contracting Governments and search and rescue (SAR) services entitled to receive the information, upon request, through a system of National, Regional, Cooperative and International DCs, applying applicable elements from the DDP provided by the DDP server and using the IDE to route all messages between DCs. Individual DCs, the DDP server and the IDE are key interdependent system components that need to be continuously maintained in order to meet the expectations of SOLAS Contracting Governments and SAR services to receive prompt and reliable LRIT information.

1.4 While DCs, the IDE and the DDP server have been designed to ensure that SOLAS Contracting Governments and SAR services are provided in a timely manner the LRIT information they are entitled to receive upon request or as a result of standing orders, it is recognized that from time to time these system components may need temporarily to suspend their operations or to reduce the level of service provided in order to carry out, inter alia, scheduled or unscheduled maintenance or upgrade of hardware or software in use, or to manage or control unforeseen events such as malicious network attacks or deal with external reasons such as unavailability of, or access to, telecommunication networks, or to the internet or to conduct emergency or urgent repairs or maintenance which cannot be deferred to a later time.

1.5 The procedures for the notification, reporting and recording of temporary suspensions of operations of, or reduction of the service provided by, components of the LRIT system (the procedures for temporary suspension of operations or reduction of the service provided) set out in annex 2 to the annex to MSC.1/Circ.1294/Rev.4, provide procedures to be followed by DCs, the IDE and the DDP server when providing salient information to other components of the LRIT system and the LRIT Coordinator in cases where they have to temporarily suspend operations or reduce the level of service provided in cases of scheduled or planned activities and unforeseen events. These procedures also set out the records to be kept in such circumstances and their availability.

1.6 The procedures for temporary suspension of operations, or reduction of the service provided, are the first steps in building a more comprehensive Continuity of service plan for the LRIT system (the Continuity of service plan). Continuity management is the process by which plans are put in place and managed to ensure that information technology systems, such as LRIT, can recover and resume normal operations after a temporary suspension of operations or a reduction of the service provided, as well as in the event of a serious disaster. It is not just about reactive measures, but also about preventive measures – reducing the risk of downtimes and disaster in the first instance.

1.7 The LRIT system presents particular challenges as it is an interdependent and international system. The IDE, the DDP server and all DC operators must work collaboratively to ensure the continuing smooth operation of the LRIT system on a day-to-day basis, which in the event of a disaster or other unforeseen event may necessitate making major operational decisions within a very short time frame. A Continuity of service plan provides the globally agreed framework within which those decisions should be taken.

1.8 Incident management, which is primarily concerned with resolving the situation and getting the system back up and running, is only one element of a Continuity of service plan. The Continuity of service plan must also address problem management, which focuses on determining the root cause of an event and interfaces with change management to ensure that the problem is not a recurrent event.

1.9 A change management plan for matters related to the LRIT system is therefore an important component of the Continuity of service plan. One of the critical issues that needs to be agreed relates to the concept of a Change Control Board and overall ongoing governance of the LRIT system. This plan addresses elements to be considered in such a Board without presuming to prescribe its composition.

2 Temporary suspension versus disaster recovery

2.1 Interruptions to the continuity of service of the LRIT system could occur as a result of either a planned or unplanned temporary suspension or reduction of the service provided of any system component, as well as a more full-scale disaster resulting in a critical failure that necessitates a comprehensive disaster recovery plan and corresponding procedures.

Impact assessment: IDE

2.2 The Continuity of service plan contains processes and procedures to address both the more routine temporary suspension, as well as measures to be taken in the event of critical failure. While such a plan must look at the system as a whole, given that there are three types of major system component in the LRIT system: the IDE, the DDP server and the individual DCs, it should outline measures to be taken in the event of, firstly, a temporary suspension or reduction of the service provided of each of these individual components; and, secondly, a disaster that results in a critical failure of each component.

2.3 The IDE is a message handling service that facilitates the exchange of LRIT information amongst DCs to enable LRIT Data Users to obtain the LRIT information they are entitled to receive. The IDE routes LRIT information between DCs using the information provided in the DDP. Any suspension of operations or reduction of the service provided by the IDE has direct and immediate implications across the entire LRIT system. A critical failure of the IDE without a comprehensive disaster recovery plan would effectively shut down the LRIT system. There is therefore a requirement for the IDE operator to make significant and real time operational decisions 24 hours a day, 365 days a year.

Impact assessment: DDP

2.4 The DDP provides operational rules facilitating the exchange of LRIT information between DCs. Unlike the IDE, a transient failure of the DDP server to provide notifications and downloads of the DDP would not necessarily completely prevent the LRIT system from continuing to function, as messages can continue to be exchanged between DCs via the IDE, disabling the DDP version number checking function.

2.5 However, the unavailability of the DDP server could affect in particular DCs or the IDE, depending on the timing and requirements of those components for obtaining the latest versions of the DDP, potentially having serious ramifications on the normal operation of the LRIT system as a whole.

2.6 Furthermore, for compliance with the provisions of SOLAS regulation V/19-1, the availability of the DDP server should be regarded as a priority equal to that of the IDE, in order to ensure that the system is operating in accordance with the predetermined rules at all times.

Impact assessment: DC

2.7 The Revised performance standards stipulate that all DCs should establish and continuously maintain systems which ensure, at all times, that LRIT Data Users are only provided with the LRIT information they are entitled to receive as specified in SOLAS regulation V/19-1. In order to meet these requirements, DCs should have procedures and processes in place to address planned or unplanned interruptions to their systems. If a DC is not functioning, or is functioning at reduced capacity, the impact is felt by every other component of the system that relies on that DC to provide timely LRIT information. There is, therefore, an expectation that DCs have a 24-hour point of contact, identified in the DDP, in the event of an impediment to continuity of service.

3 Temporary suspensions of operations or reduction of the service provided

Notifications between components of the LRIT system

3.1 All notifications between components of the LRIT system should be performed using the contact details provided in the latest available version of the DDP.

3.2 The IDE should provide the necessary functionality in the IDE administrative interface to perform all notifications and publish and update advisory notices.

3.3 Access to the IDE administrative interface should be provided to the persons in charge of the operation of the IDE, the DDP Server, all DCs, and the LRIT Coordinator, as listed in the DDP.

3.4 Whenever a new advisory notice is published, updated or removed, the IDE should automatically advise the persons in charge of the operation of the IDE, the DDP Server, all DCs and the LRIT Coordinator, as listed in the DDP.

Scheduled or planned activities requiring temporary suspension of operations or reduction of the level of service

3.5 System components requiring temporary suspension of operations or reduction of the level of service due to scheduled or planned activities should:

.1 publish an advisory notice on the IDE Administrative Interface at least five (5) days prior to the temporary suspension of operations or reduction of the level of service;
.2 confirm the advisory notice no later than 24 hours prior to the scheduled activity; and
.3 remove the advisory notice after resuming normal operation.

3.6 The advisory notice should include information on the planned or scheduled activities to be conducted; indicate the dates and times between which the activities would take place; supply information on the consequences of the activities (for example, the IDE would not be available to provide services or the DDP server would be operating at a reduced rate); and advise, if possible, any measures or arrangements which the other components of the LRIT system may need to have to put in place in order to ensure the speedy and efficient resumption of normal operations or to manage any adverse effects. If the circumstances warrant, an advisory notice can be published for a group of DCs provided the person submitting the notification is authorized to do so, as provided in the DDP.

3.7 Figure 1 illustrates the steps to be taken when a suspension of operations or reduction of level of service due to scheduled or planned activities occurs:

Unforeseen events requiring temporary suspension of operations or reduction of the level of service

3.8 Having identified an issue, the DC concerned, the IDE or the DDP server, as the case may be, should work collaboratively to resolve the issue. This may include contacting other components of the LRIT system using the contact details of the designated points of contact provided in the DDP.

3.9 Upon recognition or notification of an unforeseen event requiring temporary suspension of operations or reduction of the level of service, the system component concerned, the IDE or the DDP server, as the case may be, should try to resolve the issue and stabilize the component and, in particular:

.1 publish an advisory notice on the IDE Administrative Interface providing relevant information and including the expected time for resuming normal operation. Such a notice should be updated as and when developments occur;
.2 if, after 24 hours, the issue cannot be resolved, advise the LRIT Operational governance body^footnote, identifying the issue along with the measures or actions to be taken; and
.3 once the system component concerned resumes or restores normal operation, remove the advisory notice from the IDE Administrative Interface.

3.10 If the issue is identified by the IDE or the DDP server, then the system component concerned should be contacted to resolve the issue. If the system component concerned cannot be contacted within 24 hours, then the IDE or the DDP server, as the case may be, should publish an advisory notice on the IDE Administrative Interface on behalf of the system component concerned.

3.11 Figure 2 illustrates the steps to be taken when a suspension of operations or reduction of level of service due to unforeseen events occurs:

Identification of degradation in the level of LRIT service

3.12 If the IDE, the DDP server or a DC operator encounter degradation in the level of LRIT service as the result of issues believed to be the result of another component of the LRIT system, then the following actions should be taken:

.1 review known issues posted on the IDE Administrative interface to determine if the issue encountered was already identified by another system component;
.2 if required, use the tools available on the IDE Administrative interface to assist in troubleshooting the issue. This, for example, may include checking the IDE journal for routeing of LRIT messages or other networking functions;
.3 if the issue identified was the result of another LRIT system component, then the system component concerned should be contacted using the contact information available in the DDP; and
.4 if the system component is unable to resolve the issue after directly contacting the system component associated with the problem, or if the system component is unsure of the origins of the issue, and if the issue has reduced the operational capability of the system or is causing the LRIT system to not perform as designed, then the system component should follow the procedures specified in paragraphs 3.8 to 3.10 above.

Routine problems

3.13 In accordance with the Technical specifications for communications within the LRIT system, DCs and the DDP server, as the case may be, should transmit System status messages to the IDE every 30 minutes. These are being transmitted in order to provide the IDE with information pertaining to the operational status of the system component concerned.

3.14 If the IDE does not receive eight (8) consecutive System status messages from a specific DC or the DDP server, or if the IDE cannot successfully send eight (8) consecutive System status messages to a specific DC or the DDP server due to problem at the receiving end, and there has been no scheduled or unscheduled notification or advisory notice posted on the IDE Administrative interface by the DC concerned or the DDP server, then the IDE operator should post an advisory notice to the IDE Administrative interface and follow the procedures specified in paragraph 3.12 above. Upon notification, the DC concerned or the DDP server should follow the procedures specified in paragraph 3.9 above.

Issues related to the DDP version number checking function

3.15 In accordance with the Technical specifications for the International LRIT Data Exchange, the IDE should have the functional capability to validate the DDP version number contained in all received LRIT messages against the version number of the latest available version of the DDP.

3.16 The IDE operator is authorized to disable the DDP version checking function under circumstances that may cause or have caused a significant number of DCs and their associated SOLAS Contracting Government(s) not to be in conformance with the latest available version of the DDP and implemented by the IDE.

3.17 After disabling the DDP version number checking function, the IDE operator should follow the procedures specified in paragraph 3.12 above.

3.18 Once the issue is resolved, the IDE should enable the DDP version number checking function and advise all system components in accordance.

Invalid DDP upload (malicious or inadvertent)

3.19 Cases where the DDP file provided by the DDP server is invalid or cannot be properly processed may be separated into two categories:

.1 DDP content improperly formed (i.e. inverted polygons or other data contained within the DDP, where the DDP remains valid as per the XML schema); and
.2 a DDP file which is corrupted or otherwise invalid with regard to the XML schema.

3.20 In addition to the DDP processing procedures specified in sections 2.3.2 and 2.3.2A of the Technical specifications for communications within the LRIT system and in paragraph 3.12 above, the DDP server operator, after being notified of an issue, should take the following actions:

.1 analyse the reported problem and verify the issue. If required, the DDP server operator should request the IDE to disable the DDP version number checking function;
.2 advise all DCs, the IDE and the LRIT Coordinator about the issue;
.3 take all necessary actions to return all affected DDP versions to a valid state, including contacting the designated national points of contact for LRIT-related matters of the SOLAS Contracting Government(s) concerned, or removing or modifying data associated with the problem;
.4 contact the IDE and confirm that the issue has been resolved; and
.5 restore normal operation and notify all DCs, the IDE and the LRIT Coordinator, specifying any necessary actions to be observed or executed.

3.21 The Secretariat should report accordingly to the Maritime Safety Committee about any issue(s) with the DDP, as well as any subsequent action(s).

PKI certificate compromise

3.22 The Organization, acting as PKI Certificate Authority (CA), issues PKI certificates for the testing and production environments of the LRIT system for use by DCs, the IDE and the DDP server in relation to communications within the LRIT system.

3.23 If a system component identifies an issue which may compromise the security of a PKI certificate, then the CA, after being notified of an issue, should take the following actions:

.1 as soon as a breach in security related to an issued PKI certificate(s) is discovered, the CA should notify the IDE and the DDP server operators. The IDE and DDP server operators should take immediate action to disable all communications using the compromised PKI certificate(s);
.2 revoke, in due course, the compromised PKI certificate(s) and publish an updated Certificate Revocation List. If necessary, the CA should contact the person in charge of the affected component for further information on the issue. The affected system component may submit a request for the issue of a new PKI certificate to the CA in accordance with the procedures issued by the Organization; and
.3 issue a new PKI certificate(s) for the affected system component to resume normal operation.

3.24 Any notification about PKI compromise should be originated from the person in charge of the DC, the IDE or the DDP server, as the case may be, or from a designated national point of contact for LRIT-related matters of a SOLAS Contracting Government.

3.25 The system component affected should also follow the procedures specified in paragraph 3.9 above.

3.26 The Secretariat should report accordingly to the Maritime Safety Committee about any issue with PKI certificates, as well as any subsequent action(s).

PKI Changeover procedures

3.27 The following procedures should be observed during the PKI changeover:

.1 all PKI certificates should expire on the same date;
.2 the CA should be available before, during and after the time of changeover;
.3 the PKI changeover date should be, at minimum, two (2) weeks prior to expiration of the PKI certificates;
.4 new PKI certificates should be distributed at least two (2) weeks prior to the changeover date; and
.5 requests for the issue of PKI certificates should be submitted no less than six (6) weeks prior to the PKI changeover date.

4 LRIT Disaster Recovery

4.1 IDE Disaster Recovery

Critical failure circumstances

4.1.1 A critical failure circumstance could take place if the IDE sustains a critical failure (e.g. sustained power outage, sustained network connectivity degradation, etc.) at its host site and is not able to be reconstituted on hardware at the local host site and therefore must failover to hardware at the IDE Disaster Recovery (DR) site.

4.1.2 It is expected that an IDE DR capability would be provided for the IDE by either the primary IDE Operator or another entity.

IDE DR planning considerations

4.1.3 In accordance with the Technical specifications for the LRIT system, the IDE should have a DR site accessible every day of the year 24 hours a day.

4.1.4 The IDE DR site should have:

.1 full operational functionality, except for partial access to the IDE Journal during the DR period;
.2 off-site storage of both full and incremental backups, including backups of the journal; and
.3 data and PKI synchronization at a minimum every six (6) hours with the production environment of the LRIT system. The IDE should only be offline for a maximum period of four (4) hours. With the synchronization set to six (6) hours, there is a realized risk of a maximum loss of up to 10 hours of journal information for the IDE.

4.1.5 The IDE operator should be cognizant of firewall restrictions at the DR site and should ensure there are no restrictions at the DR site on the IP addresses accessing the production system.

4.1.6 To institute a failover to the DR site, a Domain Name Server (DNS) change is required. Most systems should be set up to refresh within 15 minutes automatically. The DNS record for the IDE should be set up to expire and refresh every 10 minutes. However, if this switch does not automatically happen, then some systems components may need to be rebooted to institute the change. Upon refresh or reboot all systems components should be operational.

4.1.7 While the IDE is failing over to the IDE DR site, the DDP version number checking function should be disabled until the IDE operator determines that the system is stable.

IDE DR testing plan

4.1.8 The IDE DR should be tested once a year in the production environment and as determined by the IDE operator. The IDE should follow the notification procedures identified in the Procedures for temporary suspension of operations and reduction of level of service. The switchover of the IDE DR in production should be communicated in advance to the LRIT Operational governance body^footnote. Critical success factors for the planned test should also be communicated via the notification process.

IDE DR management considerations

4.1.9 The IDE should be switched to the IDE DR site if the IDE operator estimates the downtime to fix an unplanned outage could take more than two (2) hours. The changeover can take up to two (2) hours. This provides for up to four (4) hours of service unavailability in the event of a critical failure of the IDE at its primary site, before which normal service should be resumed through the IDE DR site.

Notification process

4.1.10 Upon activation of the IDE DR process, the IDE operator should advise all DCs, the DDP server and the LRIT Coordinator that the IDE DR will be activated. If for any reason the IDE cannot perform the communication, then the IDE operator should contact the DDP server operator and request to perform the communication.

4.1.11 If the IDE DR site operator notes that three (3) or more System status messages from the IDE have been missed and there has been no scheduled or unscheduled notification or advisory notice posted on the IDE Administrative interface, then the IDE DR site operator should attempt to contact the IDE operator to determine the nature of problem. If, within 30 minutes, the IDE DR site operator is unable to contact the IDE, then the IDE DR site should advise all DCs and the LRIT Coordinator that there is a problem with the IDE and that the process for a failover to the IDE DR site is being activated.

4.1.12 Once the IDE DR site is activated, the IDE should advise all DCs, the DDP server and the LRIT Coordinator indicating that the IDE DR operation is ready and commencing, the IDE DR plan is now in place and the instructions previously agreed upon and documented should be implemented.

4.1.13 The IDE should remain at the IDE DR site as long as necessary and until the IDE operator determines that the primary site is ready for a return to normal operations. As soon as the primary location is ready, the IDE operator should advise all DCs, the DDP server and the LRIT Coordinator at least 24 hours prior to the return to the primary location.

4.1.14 Upon recovery to the primary location, the IDE operator should complete a report as required in the procedures for temporary suspension of operations and reduction of level of service.

IDE DR dependencies

4.1.15 Full 24/7 support and operation of the DDP server to allow endpoint for PKI to be updated and for supporting the notification process, if necessary.

4.1.16 Synchronization with the production environment of the LRIT system (data, PKI certificates).

4.2 DDP server Disaster Recovery

Critical failure circumstances

4.2.1 A critical failure circumstance could take place if the DDP server sustains a critical failure preventing it from normal operation within the LRIT system, (e.g. sustained power outage, sustained network connectivity degradation, etc.) at its host site and is not able to be reconstituted on hardware at the local host site and therefore must failover to hardware at the DDP server DR site.

4.2.2 It is expected that a DR capability, including a 24-hour monitoring of the operational system for issue resolution and the handling of DDP server DR failover, will be provided by the Organization.

DDP server DR site planning considerations

4.2.3 In accordance with the Technical specifications for the LRIT system, the DDP server should have a DR site accessible every day of the year 24 hours a day.

4.2.4 During an unplanned outage, the DDP server operator shall have up to two (2) hours to resolve the issue and restore DDP server functionality. If the outage is estimated from the outset to require more than two (2) hours to resolve, or if after two (2) hours the service cannot be restored, the transitioning process to the DDP server DR site should be initiated. The transition process may take up to two (2) hours to be completed. This provides for up to four (4) hours of service unavailability in the event of a critical failure of the DDP server at its primary site, before which normal service should be resumed through the DDP server DR site.

DR infrastructure considerations

4.2.5 The DDP server system hosted at the DR site should have full operational functionality, providing all services as on the primary site during normal operation. The DDP server DR site should be maintained on an ongoing basis and be kept synchronized with the DDP server system at the primary site, in order to facilitate an emergency failover at any time.

4.2.6 In order to keep technical complexities within reasonable limits, the DDP server DR site may lag up to six (6) hours behind the DDP server at the primary site during normal operation. As a consequence, up to six (6) hours of system data may be irrecoverably lost should the DR plan be activated.

4.2.7 The transition to the DDP server DR site during a failover exercise should be as seamless as possible to minimize the impact on the LRIT system. The DNS entry of the DDP server should be set up to expire and refresh every 10 minutes to reflect its IP address at the DDP server DR site. This approach avoids the need to change the DDP server's web service URI and therefore the requirement for having a separate PKI certificate for the DDP server DR site. The IP address of the DDP server DR site should be communicated well in advance to all LRIT system components to enable firewalls and other routing devices to permit normal communications with the DDP server at its DR location.

4.2.8 The DDP server should participate in, and execute, planned DR failover tests of the LRIT system together with all other components, in accordance with the procedures adopted for such testing.

4.2.9 It is noted that the DDP server is implemented as a module of the GISIS system, and all provisions for the DR, and downtime related to the DR testing, would apply to the GISIS system as a whole, including the accessibility of all modules by Member States and members of the public.

Notification process

4.2.10 Upon activation of the DDP server DR process, the DDP server operator should advise all DCs, the IDE and the LRIT Coordinator that the DDP server DR will be activated. If for any reason the DDP server cannot perform the communication, then the DDP server operator should contact the IDE operator and request to perform the communication. If required, the DDP server operator should request the IDE to disable the DDP version number checking function.

4.2.11 If the IDE operator notes that three (3) or more System status messages from the DDP server have been missed and there has been no scheduled or unscheduled notification or advisory notice posted on the IDE Administrative interface, then the IDE operator should attempt to contact the DDP server operator to determine the nature of problem. If, within 30 minutes, the IDE operator is unable to contact the DDP server, then the IDE should advise all DCs and the LRIT Coordinator that there is a problem with the DDP server and that the process for a failover to the DDP server DR site could be activated.

4.2.12 Once the DDP server DR site is activated, the DDP server operator should advise all DCs, the IDE and the LRIT Coordinator indicating that the DDP server DR operation is ready and commencing, the DDP server DR plan is now in place and the instructions previously agreed upon and documented should be implemented.

4.2.13 The DDP server operator should also contact the IDE and confirm that the DDP version numbers are in sequence. If after the re-establishment of service at the DDP server DR site, the DDP versions of the IDE and/or DCs are no longer synchronized with the latest DDP version published by the DDP server, then the DDP server operator should take the necessary action to publish a new version of the DDP at an appropriate version number to ensure all components are able to retrieve and consistently apply the new version of the DDP. During this time the DDP version number checking should remain disabled until all DCs and the IDE can implement the current/new version of the DDP.

4.2.14 The DDP server should remain at the DDP server DR site as long as necessary and until the DDP server operator determines that the primary site is ready for a return to normal operations. As soon as the primary location is ready, the DDP server operator should advise all DCs, the IDE and the LRIT Coordinator at least 24 hours prior to the return to the primary location.

4.2.15 Upon recovery to the primary location, the DDP server operator should complete a report as required in the procedures for temporary suspension of operations and reduction of level of service.

DDP server DR dependencies

4.2.16 Full 24/7 support and operation of the IDE for supporting the notification process, if necessary.

4.2.17 Synchronization with the production environment of the LRIT system (data, PKI certificates).