Near-Real-Time Hyperion Analysis

A cloud-based monitor for consolidation heuristics

Houston Haynes

16 minute read

tl;dr

The term “performance” can mean different things within a given corporate context. In financial firms it most often refers to profit or loss. In tech companies, it can refer to throughput of an application or service. In this case, the term “performance” shares both of those connotations. The company was upgrading their Oracle Hyperion Financial Management installation, which is used to measure the company’s financial performance (and supports other tasks, such as forecasting). And because of the global nature of the company, and reliance on detailed ledger data ETL’d in from remote manufacturing locations, there was also a system-level performance aspect to the assignment. Due of the critical nature of close period reporting, there was a high degree of sensitivity around timely submission of data. This entry in my portfolio will focus on that latter portion of the “performance” equation, though the implications to the former will become clear as this entry progresses.

Beginning With The End In Mind

There are other visualizations based on this schema which was built in PowerBI, and are demonstrated through this portfolio entry. Another useful “end point” to consider is the schema diagram shown here. As I will detail later, the Concurrency table is the hub, as it has a row for every “grain” in the temporal span. From that, all time-based tables have a field with many-to-one relationship mapped to that central table. As visualizations are presented, the nature of how that hub maintains consistent time-scoping between various elements of a given chart will become apparent.

In the following pages, aside from the technical walkthrough, I will also detail some of the business and justifications for the process as a whole. While it would certainly help to have familiarity with Hyperion’s lexicon, it’s certainly not a requirement. This should not be considered a primer on Accelatis or Hyperion. Terms from those applications will be used in passing, with focus on how the data is collected and visualized. My goal is show how meaningful business process information can be gleaned from low-level data that the vendors themselves have not recognized. So, I’ve made an effort to keep the language as broad as possible.

Accelatis Pitfalls

In order to understand this solution, it’s useful to see some key points in Accelatis Performance Management Platform. This section of the document details how I first attempted to stay with “native” data and visualizations, and the efforts that followed from encountering its limitations. One of the major stumbling blocks with Accelatis is the lack of a consolidated view of activity. Charts for system performance and charts for user activity were in separate areas of the application. Similarly, consolidation views were in still a third area of the application. Any question about this led to the Accelatis consultants steering me to extract data out of the application and re-charting in Excel.

Challenge 1: CSV Extracts from Accelatis

II started moving data out of Accelatis for its own analysis by using CSV files exported from the Accelatis Performance Lab and Activity Analysis Grid View. However, manual extraction of the data and then creating all of the necessary charts and pivots was both time-consuming and limiting.

Aside from the inability to see “rolling reports” as was envisioned when the company licensed the tool, the CSV files were missing columns of data that I was looking to chart in the reports. Due to those factors, as well as general limitations in the Accelatis CSV data, (such as missing agent location information) I came up with our own method of direct SQL extraction to present findings in the deliverables.

Challenge 2: Direct Query of Accelatis Data

As soon as I stepped away from the Accelatis application context to query the data directly from the database, other ‘wrinkles’ were found in the data. On the plus side, we had access to agent fields in the “AUXDATA” column of the Opt_ActivityPerformance table. These were the actual agent machine names, which gave us a corollary to the geographical location of the activity. This was not available to us in the Grid View of Accelatis’ Performance Lab.

On the negative side, we found that each agent reported the time stamp of their activity according to the local time zone of that agent. So, a time stamp from the agent in Shenzhen could be 15 or 16 hours “ahead” of the time zone in El Segundo (depending on whether daylight saving time was in effect on the west coast). This led to a two-step process for deriving the “window” of time for each location.

  1. Use the date of the “local start time” to calculate the number of “Adjustment” hours for each location
  2. Calculate the Start and End time for each location based on the derived Adjustment value

Calculating the “Adjustment” value is non-trivial, as Daylight Saving Time applies (where it applies) to different regions of the globe over different date ranges. So aside from “pulling everything in” to match US local time during DST, there were other regions (such as Mexico and Europe) that had their own adjustments, gated by different calendar dates). After all of that was sorted, I created a single step in the SSIS process to populate the “Adjustment” column. From there, the “Start” and “End” DateTime values are calculated for each location, again, this was done within a single iterative SSIS task.

The fully populated table (after the SSIS setup processing) contains the relative time periods for each agent location. The data in this table is used by downstream SSIS objects to pull Accelatis data for the time period that’s “correct” for each respective location. After that initial data is pulled, a separate “Adjusted DateTime” column is populated, again using the “Adjustment” value to subtract the necessary hours and insert the values in the new column. This ensures two things: 1) that the time values for each location is sortable by local time, and 2) the process can be used to “reach back” into the historical record of the database to retrieve records for any period and any location. This allows for follow-up analysis runs when the working group might opt for other, more detailed findings from historical data.

Challenge 3: Accelatis time stamps (.NET and Unix issues)

Once I had extracted the local agent data, I noticed that the time stamp string would present a problem. Even though we were in an Oracle solution stack, the Accelatis tooling was built on a .NET code base. This meant that timestamps had the now-infamous “leap year bug” that stemmed from Excel attempting to support an error in Lotus 1-2-3. So instead of adding the date to 1-Jan-1900 I had to use 30-Dec-1900 (because .NET also incorrectly calculates for a leap year in 2000).

SELECT DISTINCT Substr(accelatis.metricsamples.servername, 7, 6) AS Server, 
                To_date('12/30/1899', 'MM/DD/YYYY') 
                + accelatis.metricsamples.datetime               AS DATETIME, 
                accelatis.metricsamples.metricvalue              AS CPU 
FROM   accelatis.metricsamples 
       join accelatis.metricdefs 
         ON accelatis.metricdefs.metricdefid = 
            accelatis.metricsamples.metricdefid 
WHERE  To_date('12/30/1899', 'MM/DD/YYYY') 
       + datetime >= To_date('3/6/2016 11:12:00 AM', 'MM/DD/YYYY HH:MI:SS AM') 
       AND To_date('12/30/1899', 'MM/DD/YYYY') 
           + datetime <= To_date('3/6/2016 1:12:00 PM', 'MM/DD/YYYY HH:MI:SS AM' 
                         ) 
       AND accelatis.metricdefs.metricname LIKE '%Processor Time' 
ORDER  BY datetime ASC

To make the situation even more confusing – a bug was discovered in the timestamp code that Accelatis used to monitor Unix servers. The net result was that database CPU values were again mis-counted by one day after Febrary 29, 2016. So again, I had to split off the database CPU extraction because I had to then decide whether to use 30-Dec or 31-Dec as the epoch day-month in the calculation. This also meant that database CPU data overlapped on 29-February and 1-March, so data from that time period was nearly impossible to disambiguate. Needless to say, this was reported to Accelatis, and hopefully it will be resolved before Leap Year 2020.

Data Enrichment

Microsoft SQL Server Integration Services (SSIS) extracts data from both Accelatis and HFM schemas, which reside on the same Oracle database. It then enriches and moves data into an ersatz temporal data warehouse.The term “ersatz” is used because there are no relationships to the tables as they stand within the store. As I showed in the first page of this entry, the relationship between the concurrency table and the other time-based tables is created within PowerBI, as that effects the behavior of drill-downs and other forms of filtering.

Staging tables – Agent data

For each agent there is a representative staging table in the DW. That table holds data that’s specific to that agent’s activity during the activity window.

SELECT auxdata    AS AGENT, 
       activity, 
       To_date('12/30/1899', 'MM/DD/YYYY') 
       + datetime AS ORIGIN_DATETIME, 
       duration, 
       concurrency, 
       resultstring 
FROM   accelatis.opt_activityperformance 
WHERE  auxdata = [machine name] 
       AND To_date('12/30/1899', 'MM/DD/YYYY') 
           + datetime >= To_date([location startdatetime], 
                         'MM/DD/YYYY HH:MI:SS AM') 
       AND To_date(datestring, 'MM/DD/YYYY HH:MI:SS AM') <= 
           To_date([location enddatetime], 'MM/DD/YYYY HH:MI:SS AM') 
       AND duration NOT IN ( 0 ) 

Note that there’s both an “ORIGIN_DATETIME”, which is the local time as reported originally by the agent, and a separate “DATETIME” column. This field is part of the SSIS enrichment process, which subtracts the number of hours from the Agent fact table (as described previously) for that location to give the compensated time matching the server time on the west coast. These agent-specific staging tables contain both successful activities and failures. Later in the SSIS process those are split off from each staging table into consolidated “ALLDATA” and “ERRORS” tables for reporting. (In the query below, the highlighted AUXDATA and date-time values are substituted as part of the SSIS process)

Data Tables – Consolidations

A primary driver of load on the HFM server is global consolidations. For that reason we made effort to reconcile each consolidation event for concurrency, and captured the actual detail information as they occurred.

SELECT activitysessionid               AS SESSIONID, 
       To_date('12/30/1899', 'MM/DD/YYYY') 
       + starttime - interval '7' hour AS STARTTIME, 
       To_date('12/30/1899', 'MM/DD/YYYY') 
       + endtime - interval '7' hour   AS ENDTIME, 
       Substr(strdescription, 42)      AS DESCRIPTION 
FROM   hfm.prod_task_audit 
WHERE  strdescription LIKE '%Consolidation Completed%' 
       AND To_date('12/30/1899', 'MM/DD/YYYY') 
           + starttime - interval '7' hour >= 
           To_date('4/6/2016 11:12:00 AM', 'MM/DD/YYYY HH:MI:SS PM') 
       AND To_date('12/30/1899', 'MM/DD/YYYY') 
           + starttime - interval '7' hour <= 
           To_date('4/6/2016 1:12:00 PM', 'MM/DD/YYYY HH:MI:SS PM') 
ORDER  BY starttime ASC 

Note here that Consolidation times are recorded in GMT. So again, an on-the-fly conversion is required – one that shifts depending on whether the date period is during Daylight Saving Time.

Data Tables – HFM and DB CPU

As shown on the previous page, another key metric is CPU level for the main application and database servers. In most cases all server values are queried for the given time period, and PowerBI is used to filter, sort and categorize the results.

Note that the “SERVER” column has a somewhat-human-readable “HFMxxx” indicator for each entry. This is another substring function based on the actual server name as recorded in the database, shortened here to simplify the presentation of the CPU legend in PowerBI. (Like previous examples, the highlighted datetime values are substituted as part of the SSIS process, read in from the project-level parameters that are set before run-time)

This is the portion of the process which has the most technical “inner workings”. Up to this point SSIS has been used mainly for parameter substitution to collect data through queries. This section will show how C# scripting is used to construct the concurrency table.

Here is an example of the final result with calculated values. From that, a stepped area can be plotted. This supports analysis of user activity against server demand. To produce the table with various integer counts at each second, we must first know when each event starts and completes. We already have this data collected in the LIST_CONS table, so the process here derives the data from that table.

for (DateTime date = CurrentRow; CurrentRow <= PeriodEnd; CurrentRow = CurrentRow.AddSeconds(1.0)) {
 if ((CurrentRow >= CONS_StartTime) && (CurrentRow <= CONS_EndTime)) {
  comm = new SqlCommand("Select Consolidation FROM CONCURRENCY WHERE DateTime = ('" + CurrentRow.ToString(@ "MM-dd-yyyy hh:mm:ss tt", new CultureInfo("en-US")) + "')", conn);
  conn.CreateCommand();
  ConcurrencyValue = (int) comm.ExecuteScalar();
  ConcurrencyValue = ++ConcurrencyValue;
  comm = new SqlCommand("UPDATE CONCURRENCY SET Consolidation = " + ConcurrencyValue.ToString() + " WHERE  DateTime = ('" + CurrentRow.ToString(@ "MM-dd-yyyy hh:mm:ss tt", new CultureInfo("en-US")) + "')", conn);
  conn.CreateCommand();
  rowsAffected = comm.ExecuteNonQuery();
 }
}

The for loop controls the “march down” through the values of the concurrency table. The check is made at each row of the table via the if statement. Only rows that match the criteria are updated, so only fields that correspond to the time range of the consolidation are “touched”. While it looks like there’s a connection made in every iteration through the for loop, SSIS handles this elegantly and uses one connection for all calls during the task. From that point it’s simply a matter of charting those values in visualization tools such as PowerBI or HighCharts.js. The latter example can be seen at the top of this section, and PowerBI is used in the next chart.

Visualization, Part 1

In previous assignments I have implemented Analysis Services “cube” to view and analyze data via SSRS or Excel data source connection. In other cases, I’ve created a PowerQuery profile to be used in a shared document. PowerBI fits somewhere in the “Goldilocks Zone” between the two. Beyond Microsoft’s legacy products, those who are familiar with Tableau will also find many similarities.

Derivation of maximum reasonable response times for SLAs

One of the initial motivations for using a tool like PowerBI was to calculate statistical data and analyze response times for various task/location combinations without having to build an entire Analysis Cube. Due to the variability of response times for tasks originating from HFM locations, we sought a means to view the data both in the aggregate and based on location-specific data. I decided to use a 95th percentile measure as the “high water mark”. It seemed reasonable to view values beyond the 2nd standard deviation (2σ – assuming a normal distribution) as outliers that might require further review.

As opposed to creating a calculated column of data, DAX calculates “on the fly” based on the scope of data selected. The data can be constrained by time interval, task type and as was the focus for our purposes – location.

The chart above shows an example where the 95th percentile is calculated for each task type for all locations. The column presents the minimum, maximum and average using built-in PowerBI functions.

This is an example where PowerBI is used to analyze the duration for tasks on a location basis (with the lower chart showing duration for all other locations). The top chart focuses on data from the Penang location, while the summary table below has all data except for values from Penang.

Analysis – using PowerBI to review error incident rates by location and type

PowerBI was also used to troubleshoot error conditions during simulation runs, and would be later used as a tool for production support. Here we have pie charts that show errors which occurred by location and by task type. Because of the relationship between the various tables pulled from the data snapshot, selecting an area of a pie chart also auto-focuses the data set that is shown for all other charts on display. Below – if you select the “Amstelveen” portion of the pie chart that all other charts will follow that as an ersatz filter. To change all of those visual elements, simply select a different location or an error type from the other pie chart.

Note that in all cases the highlighting/lowlighting of pie chart areas maintains the overall view of data. Simply selecting an alternate pie chart area updates the other respective areas of the displayed information.

On a business note: In this case the 404 errors were caused by an EPM server using a different authority manager than other machines in the tier, and occasionally not authenticating sessions/connections made from the web server. This chart was used to isolate whether the errors were by location, request type, or user. It turned out to be none-of-the-above, and once all of the servers were pointing to the same grant authority the issue was resolved. As such, the report was kept as a tool for future troubleshooting tasks.

Visualization, Full Circle

For the purposes of the graph below the primary interest is viewing interaction between user Activities, Consolidations and HFM server demands. One of the early issues found regarded “sudden” inflation of response times for Activities where the database usage was otherwise low. The first chart in the visualization below is an example of what users observed. Note that when Activity times are longest that the database level is near nominal levels.

It was difficult to pinpoint the core issue using only the standard Accelatis performance views, but here, where the graphs are all charted together, the cause becomes more transparent. While it seems unlikely with low database CPU usage that user activities would be delayed, it can bee seen here the the HFM server – the home of the “cube” that resides over the datastore, continues to process data – with locks applied to various intersections of data while the processing is completed. Oracle does a good job of orchestrating and prioritizing these tasks, but there’s an inflection point unrelated to database usage where the queuing expands considerably. Without seeing this “confluence of events” the cause can be maddeningly elusive. With that “lurking observation” brought to the surface and correlated in the visualization, the relationship becomes obvious.

Similar to visualizing overall request latency based on server load, its also possible to see what’s occurring on a per-location basis with the same corpus of data. Here you can see we’ve returned to the chart that served as introduction to this portfolio entry. While there are only 7 Agents listed for this period, there are as many as 10 running at any given time. Beyond that, when users are active in the system their location can be reverse mapped using the IP address of the requests they send. In all there are more than 50 discrete points that can be fixed to a point on this global location, and the chart can be further tailored to focus on a specific time period and Activity type.

In one instance this data helped to isolate location-specific network issues impacting systems well beyond Hyperion. As is the case, locations that are close to each other geographically can be on completely different network “backbones”. Shenzhen and Hong Kong are one such example – the former always performing well over time, and the latter almost always an outlier – for all Activity types. It turned out that there wasn’t just an issue with the Hyperion configuration or the scope of consolidations or other requests performed – but a network routing error at the Hong Kong location that caused the issue. Once that was resolved, not only did that location fall back within the range of response times seen by Shenzhen and other global locations – but overall network latency improved for all users at the Hong Kong location. So, in that manner the detailed scrutiny of location-based client-side data led to an overall quality-of-work improvement for everyone in that office – not just the Hyperion users.

With this system the financial performance measurement will likely see impact from the performance of the Hyperion system itself. And while most of these data snapshots are user-triggered in nature, with the new real-time streaming analytics functionality in PowerBI it’s possible to now keep a ‘running tab’ on systems and user locations in near-real time. This would be the final component to matching (and in many cases exceeding) the functionality of a Hyperion/Accelatis implementation.

Key Value
BuildDateTime 2019-10-26 18:37:26 -0700
LastGitUpdate 2019-10-19 17:46:33 -0700
GitHash bb1a586
CommitComment contact form write-up