ACIS Anomaly Documentation

These pages detail possible ACIS-related anomalies that may occur. Each page describes the anomaly itself, when it has happened in the past, how likely it is to happen again, and what the response should be. Appropriate links are also given to relevant flight notes, procedures, and other documentation.

Contents:

DPA-A Anomalous Shutdown

What is it?

The DPA-A shuts down anomalously, presumably due to a spurious command.

Note

The diagnosis and response to this anomaly presented in this document assumes that the active BEP is powered by DPA side A.

When did it happen before?

The DPA-A has shut down 4 times over the mission:

  • October 26, 2000: 2000:300:15:40, obsid 979
  • December 19, 2002: 2002:353:20:26, obsid 60915
  • January 12, 2015: 2015:012:00:01, obsid 52186
  • December 9, 2016: 2016:344:07:40, obsid 17615

Will it happen again?

It appears likely that the anomaly will occur again if the mission continues.

How is this anomaly diagnosed?

Within a major frame (32.2 seconds), one should see:

  • 1DPPSA (DPA-A Power Supply On/Off) change from 1 to 0 (On to Off)
  • 1DPP0AVO (DPA-A +5V Analog Voltage) drop to 0.0 +/- 0.3 V
  • 1DPICACU (DPA-A Input Current) drop to < 0.2 A (this value is noisy, so take an average)
  • DPA-A POWER should go to zero
  • 1DP28AVO (DPA-A +28V Input Voltage) is expected to have a small uptick, ~0.5 V, consistent with the load suddenly dropping to zero
  • The software and hardware bilevels will likely not have normal values if BEP side A is active.

All other hardware telemetry should be nominal. The current values for these can be found on our Real-Time Telemetry pages. Older data can be examined from the dump files or the engineering archive.

What is the response?

Our real-time web pages will alert us and the Lead System Engineer will call us. We need to:

  • Send an email to the ACIS team at the official anomaly email address. If it is off-hours, call Peter and Bob.
  • Send an email to sot_red_alert announcing that the ACIS team is aware of the DPA-A shutdown and is investigating, and that a telecon will be called when more information is available.
  • Contact the GOT Duty Officer to inform that we need the dump data as soon as possible and to email or call us when the dump file is available.
  • Process the dump data and make sure that there is nothing anomalous in the data BEFORE the shutdown. We want to know if a new occurrence looks just like the previous occurrences. If yes, it should appear as if in one frame the DPA-A turned off.
  • Once analysis of the dump data is complete, convene a telecon at the next reasonable moment with the ACIS team and review the diagnosis. The MIT ACIS team (Peter Ford, Bob Goeke, Mark Bautz, and Bev LaMarr) should also be included in the discussion, either in the telecon or via email. If the diagnosis is consistent with previous DPA-A anomalies, proceed with the recovery. If the diagnosis is not consistent with previous DPA-A anomalies, stop and start a more involved analysis with the ACIS team.
  • As soon as the time of the shutdown is known, inform sot_yellow_alert.
  • Identify whether or not additional comm time is needed and if so ask the OC/LSE to request it.
  • Send an email to sot_red_alert and call a telecon with the FOT, SOT, and FDs to brief them on the diagnosis and the plan to develop a CAP to recover.
  • Prepare a CAP and submit it for review to capreview AT ipa DOT harvard DOT edu, and cc: acisdude. It will also be necessary to call the OC/CC to determine which number should be used for the CAP. This CAP will have the following steps:
  • Execute the CAP at the next available comm. Reloading the flight software patches can take a half an hour, so ensure that there is enough time in the comm to execute the entire procedure.
  • Write a shift report and distribute to sot_shift to inform the project that ACIS is restored to its default configuration.

Note

As of this writing, the latest ACIS Flight Software Patch is F, Optional Patch G. Before preparing the CAP, the latest version of the procedure should be checked.

Impacts

  • Until the DPA-A is powered back on, science operations will be interrupted.
  • The warmboot of the BEP will reset the parameters of the TXINGS patch to their defaults. They can be updated in the weekly load through a SAR.
  • After recovery from a DPA-A shutdown, the power status may be in an unusual state (e.g., lower than expected input current) due to FEPs being off. This situation should resolve itself with the next observation.

Relevant Procedures

FOT Procedures

  • SOP_ACIS_DPAA_ON (PDF) (DOC)
  • SOP_ACIS_SW_STDFOPTG (PDF) (DOC)
  • SOT_SI_SET_ACIS_FP_TEMP_TO_M121C (PDF) (DOC)

CLD Scripts

CAPs

  • CAP 1407 (DPA-A Poweroff Recovery) (PDF) (DOC)
  • CAP 1342 (DPA-A Poweroff Recovery) (PDF) (DOC)
  • CAP 818 (DPA-A Side Recovery from Enabled/Powered Off State) (PDF)

DPA-B Anomalous Shutdown

What is it?

The DPA-B shuts down anomalously, presumably due to a spurious command.

Note

The diagnosis and response to this anomaly presented in this document assumes that the active BEP is powered by DPA side A.

When did it happen before?

The DPA-B has shut down once:

  • December 13, 2007: 2007:347:17:50, obsid 58072

Will it happen again?

It appears likely that the anomaly will occur again if the mission continues.

How is this anomaly diagnosed?

Within a major frame (32.2 seconds), one should see:

  • 1DPPSB (DPA-B Power Supply On/Off) change from 1 to 0 (On to Off)
  • 1DPP0BVO (DPA-B +5V Analog Voltage) drop to 0.0 +/- 0.3 V
  • 1DPICBCU (DPA-B Input Current) drop to < 0.2 A (this value is noisy, so take an average)
  • DPA-B POWER should go to zero
  • 1DP28BVO (DPA-B +28V Input Voltage) is expected to have a small uptick, ~0.5 V, consistent with the load suddenly dropping to zero
  • If 1STAT2ST = 0, the DPA-B shutdown has caused a watchdog reboot of the BEP in use. This will happen if the DPA-B shuts down during an observation which is using B-side FEPs.

All other hardware telemetry should be nominal. The current values for these can be found on our Real-Time Telemetry pages. Older data can be examined from the dump files or the engineering archive.

What is the response?

Our real-time web pages will alert us and the Lead System Engineer will call us. We need to:

  • Send an email to the ACIS team at the official anomaly email address. If it is off-hours, call Peter and Bob.
  • Send an email to sot_red_alert announcing that the ACIS team is aware of the DPA-B shutdown and is investigating, and that a telecon will be called when more information is available.
  • Contact the GOT Duty Officer to inform that we need the dump data as soon as possible and to email or call us when the dump file is available.
  • Process the dump data and make sure that there is nothing anomalous in the data BEFORE the shutdown. We want to know if a new occurrence looks just like the single previous occurrence. If yes, it should appear as if in one frame the DPA-B turned off.
  • Once analysis of the dump data is complete, convene a telecon at the next reasonable moment with the ACIS team and review the diagnosis. The MIT ACIS team (Peter Ford, Bob Goeke, Mark Bautz, and Bev LaMarr) should also be included in the discussion, either in the telecon or via email. If the diagnosis is consistent with previous DPA-B anomalies, proceed with the recovery. If the diagnosis is not consistent with previous DPA-B anomalies, stop and start a more involved analysis with the ACIS team.
  • As soon as the time of the DPA-B shutdown is known, inform sot_yellow_alert.
  • Identify whether or not additional comm time is needed and if so ask the OC/LSE to request it.
  • Send an email to sot_red_alert and call a telecon with the FOT, SOT, and FDs to brief them on the diagnosis and the plan to develop a CAP to recover.
  • Prepare a CAP and submit it for review to capreview AT ipa DOT harvard DOT edu, and cc: acisdude. It will also be necessary to call the OC/CC to determine which number should be used for the CAP. The steps in the CAP will depend on whether or not the active BEP has executed a watchdog reboot. This may happen if the shutdown occurs during an observation that utilitizes the side B FEPs (side B powers FEPs 3-5), or if a subsequent observation requests them. Note that this implies that a watchdog reboot of the BEP will be avoided only if it occurs during an observation using only 1 or 2 CCDs, and until an observation occurs using 3 or more CCDs.
    1. If the BEP has not executed a watchdog reboot, the steps should be:
      • Turn on DPA side B (SOP_ACIS_DPAB_ON, this can be executed even if a science run is currently in progress, see note below).
    2. If the BEP has executed a watchdog reboot, the steps should be:
  • Execute the CAP at the next available comm.
  • Write a shift report and distribute to sot_shift to inform the project that ACIS is restored to its default configuration.

Note

At this point in the mission (Jan 2017), it is standard practice to power off unused FEPs to reduce power consumption and keep the electronics temperatures lower. For this reason, it is believed that it is safe to power on the DPA side B during a science run that does not use these FEPs. If this practice is changed later in the mission, this procedure may have to be revisited.

Impacts

  • Until the DPA-B is powered back on, science operations which require the use of the side B FEPs will be affected.
  • If it is necessary to warm boot the BEP, this will reset the parameters of the TXINGS patch to their defaults. They can be updated in the weekly load through a SAR.

Relevant Procedures

FOT Procedures

  • SOP_ACIS_DPAB_ON (PDF) (DOC)
  • SOP_61021_STANDBY (PDF) (DOC)
  • SOP_ACIS_WARMBOOT_DEAHOUSEKEEPING (PDF) (DOC)

CAPs

  • CAP 1055 (Commanding to Turn On DPA Side B and Warm Boot BEP Side A) (PDF) (DOC)

DEA-A Anomalous Shutdown

What is it?

The DEA-A shuts down anomalously, presumably due to a spurious command.

When did it happen before?

The DEA-A has shutdown only once in the mission, in 2005:

  • September 15, 2005, 2005:258:23:31:29, obsid 6221

Will it happen again?

It appears it could happen again, but one occurrence in 16 years provides little guidance.

How is this Anomaly Diagnosed?

Within a major frame (32.2 seconds), one should see:

  • 1DEPSA (DEA-A Power Supply On/Off) change from 1 to 0
  • All DEA-A Analog Voltages (1DEP3AVO, 1DEP2AVO, 1DEP1AVO, 1DEP0AVO, 1DEN0AVO, 1DEN1AVO) go to 0.0 +/- 0.5 V
  • 1DEICACU (DEA-A Input Current) drop to < 0.2 A (this value is noisy, so take an average)
  • DEA-A POWER should go to zero
  • 1DE28AVO (DEA-A +28V Input Voltage) is expected to have a small uptick, ~0.5 V, consistent with the load suddenly dropping to zero

All other hardware telemetry should be nominal. The current values for these can be found on our Real-Time Telemetry pages. Older data can be examined from the dump files or the engineering archive.

What is the first response?

Our real-time web pages will alert us and the Lead System Engineer will call us. We need to:

  • Send an email to the ACIS team (including Peter Ford, Bob Goeke, Mark Bautz, and Bev LaMarr),
  • Process the dump data and make sure that there is nothing anomalous in the data BEFORE the shutdown. We want to know if a new occurrence looks just like the previous occurrences. If yes, it should appear as if in one frame the DEA-A turned off.
  • Convene a telecon at the next reasonable moment. A DEA-A recovery requires that the BEP be warmbooted after the DEA-A is back on (see the SOP and Flight Note 572 for details).

Impacts

  • Until the DEA is powered back on, science operations will be interrupted.
  • After DEA is powered back on, the focal plane temperature will be unregulated and possibly uncalibrated. Recovery will require a BEP warmboot and setting the focal plane temperature to -121 \(^{\circ}\rm{C}\).
  • The warmboot of the BEP will reset the parameters of the TXINGS patch to their defaults. They can be updated in the weekly load through a SAR.

DEA Sequencer Reset During a Science Observation

What is it?

The DEA sequencer crashes, resulting in loss of science data from all video boards for the rest of the science run.

When did it happen before?

The DEA Sequencer has reset three times during the mission; most recently in 2013:

  • June 22, 2004: 2004:174:20:49, obsid 5008
  • February 13, 2009: 2009:044:00:14, obsid 10275
  • July 30, 2013: 2013:211:05:42, obsid 15474

Will it happen again?

It appears likely that the anomaly will occur again.

How is this Anomaly Diagnosed?

  • The science run will be terminated, sending a scienceReport packet that will contain non-zero fepErrorCodes and a non-zero terminationCode.
  • Within a major frame (32.2 seconds), there will be a slight drop in DPA Input Current A and/or B, (1DPICACU and/or 1DPICBCU) ~0.2-0.3 A. The drop in two input currents depends on which side is running which FEPs.
  • Within a major frame, the 1STAT1ST bilevel will toggle off (science idle).
  • DEA Housekeeping will show sharp drops in the temperatures of FEP 0 and/or FEP 1, if they are running.

What is the first response?

Most likely we will be notified by CXCDS Ops that data ceased prematurely for an observation. We need to:

  • Send an e-mail to the ACIS team (including Peter Ford, Bob Goeke, Mark Bautz, and Bev LaMarr)
  • Process the dump data and get access to the CXC products
  • Convene a telecon at the next reasonable moment.
  • Examine data from the next observation because the setup for the next observation should clear the problem.

Impacts

  • The last portion of the science run will be lost. The following science run should be unaffected.

Relevant Notes/Memos

Access locked:

Open access, full memo:

FEP Anomalous Reset

What is it?

One or more (but not all) FEPs can reset during an observation, resulting in loss of science data from the affected FEPs for the rest of the science run.

The unaffected FEPs will continue to operate normally. This is different than the DEA Sequencer Reset During a Science Observation anomaly which shuts down all FEPs.

When did it happen before?

FEP resets have occurred three times during the mission; most recently in 2013:

  • April 6, 2007, 2007:096:19:24, obsid: 7647
  • March 10, 2008, 2008:070:15:31, obsid: 7783
  • May 16, 2013, 2013:136:19:57, obsid: 15232

Will it happen again?

It appears likely it could happen again. Side B of the DPA seems particularly susceptible to power transients.

How is this Anomaly Diagnosed?

If all of the FEPs were halted, this could be an instance of DEA Sequencer Reset During a Science Observation. However, if some FEPs were halted but others continued, then the resets were most likely caused by a momentary “glitch” in the DPA-B +5V supply which halted the CPUs that were currently powered by that supply.

To be certain:

  1. Obtain the relevant dump data file from the directory /dsops/critical/GOT/input/, which is located on the Ops LAN.

  2. Run ACORN on the file as per the instructions in “Running ACORN on data dumps in the case of an anomaly (04/06/16)”

  3. Run the MIT decom on the data dump as per the instructions in “Memo on Running MIT tools (04/26/16)”

  4. Look for the line that gives the tiem of the last exposure packet from the offending FEPs. It will look like this:

    136:19:57:18 - Last exposure packets from FEPs 3 and 4, exposureNumber=11324

  5. Look for when the FEPREC_RESET occurred. The line will look like this:

    136:19:57:59 - SwHousekeeping FEPREC_RESET reported, count=2, value=4

    If the time between those two events is less thatn 449 seconds, then the reset was not due to a watchdog timer-induced reset.

  6. Look at the following MSIDs:

    • DPA5VHKB
    • 1DPICBCU
    • 1DPP0BVO
    • 1DPICACU

    Typically, DPA5VHKB bounces around +/- a volt. If, however you see it steadies up right around the time of the FEP halt, this indicates that all DPA-B boards were in a reset state.

    Check to see if the DPA-B current (1DPICBCU) dropped at around the same time while the DPA-B voltage (1DPP0BVO) and the DPA-A current and voltage remained steady.

The above behavior confirms that the FEPs actually reset.

What is the first response?

Most likely we will be notified by CXCDS Ops that data from one or more of the CCDs stopped during an observation. We need to:

  • Send an email to the ACIS team (including Peter Ford, Bob Goeke, Mark Bautz, and Bev LaMarr)
  • Process the dump data and get access to the CXC products
  • Convene a telecon at the next reasonable moment.

Impacts

Usually the power down prior tot he next observation clears the anomaly. If the target is not on one of the halted FEPs, then it is likely that the science objectives of the observation will be met.

We should examine data from the next observation because power-cycling the FEPs should clear the condition. But if the next observation uses the same configuration, the FEPs will not be power cycled and the anomaly will persist.

Hi/Lo Pixel Anomaly

What is it?

Event data stops being reported for one CCD/FEP combination and the delta overclock values are peculiar, large and negative for the first three output nodes and large and positive for the fourth output node.

When did it happen before?

Twice:

  • March 7, 2011: obsid 12934
  • October 31, 2013: obsid 16496

Will it happen again?

It appears likely it could happen again.

How is this Anomaly Diagnosed?

Both of the following symptoms will be noticed:

  • One or more FEPs will stop returning event data.
  • The deltaOverclock values reported from these FEPs are large and negative for the first three output nodes and large and positive for the fourth output node.

Both of these symptoms can be observed from one of the PMON pages.

What is the first response?

Most likely we will be notified by CXCDS Ops that data ceased prematurely for a single CCD for an observation. But there is a chance we can catch this anomaly in a realtime contact during a long observation. If yes, we have a SOP written to intervene if the observation with the anomaly is still in progress to dump diagnostic data.

If not, we need to:

  • Send an email to the ACIS team (including Peter Ford, Bob Goeke, Mark Bautz, and Bev LaMarr)
  • Process the dump data and get access to the CXC products
  • Convene a telecon at the next reasonable moment.
  • Examine data from the next observation, because the setup for the next observation should clear the problem.

Impacts

  • The last portion of the science run for that particular CCD/FEP combination will be lost. The following science run should be unaffected.

Relevant Procedures

Trickle Bias / T-Plane Latchup Anomaly

What is it?

Science event telemetry begins before bias map has been telemetered, overloading direct memory transfers from one or more FEPs, resulting in locked values for the FEP Threshold Plane.

When did it happen before?

Three times:

  • June 27, 2000: obsid 371
  • October 29, 2001: obsid 3403
  • November 4, 2001: obsid 2010

Will it happen again?

Most likely not. The untricklebias patch, installed in Flight Software Version B-opt-C in 2003, calls all biasThief methods from within the science thread, preventing simultaneous telemetry of bias and event packets.

How is this Anomaly Diagnosed?

We are speaking of two symptoms, interleaved bias and event packets, and frozen threshold crossing values. The most immediately obvious symptom is saturated telemetry, with far too many events from the affected FEP or FEPs. On an affected FEP, the threshold crossing counts will not change from one frame to the next. The symptoms disappear with the next science run, and may become apparent only when SSR data are examined.

What is the first response?

In 2001, the response was to recycle FEP power. This is now automatically built into the start of each SI mode command sequence, and is unnecessary. The subsequent science run in the load will execute normally, so no corrective action is necessary.

If we see T-Plane latchup on coming into comm, it may be worth trying to salvage the remainder of the science run. Check whether a CLD exists for a WSPOWXXXXX packet to command the required set of DEA boards and FEPs. If so, and significant time remains for the obsid, execute stop science, WSFEPALLDN, the WSPOWXXXXX command, and start science.

A special case: if the following science run is a no-bias version of the same SI mode, it will not recycle the FEP power. Send a stop science and a WSFEPALLDN command to ACIS.

Afterward, the team may examine the telemetry stream at leisure to see what may have triggered the latchup. If there was no interleaving of bias and events, look for any other simultaneous high demand on DMA transfers out of the affected FEPs.

Impacts

Once the T-Plane latches, science will be lost from the latched FEPs for the rest of the science run, and some science is likely to be lost from remaining FEPs due to telemetry saturation.

Relevant Procedures

Command Files

Currently there are 2 6-chip and 3 5-chip power commands in CLDs. The 5-chip commands all power an additional, unneeded FEP.