HomeBlogTexas Ranger

Previous arcticle: AI Charts and Dashboards

In our previous posts we talked about the instrumentation of a single page app and how to visualize our app's behaviour on dashboards and charts but so far we have been kind of reactive: We observed/monitored our app and in case of problems we drilled down into dashboards, charts and other AI tools. Wouldn't it be nice to identify potential APM problems earlier in the application lifecycle? There various ways to do so.

Load- and Stress-Tests

Load- and stress-tests are great for finding performance problems before they go to production. Since they require a reasonably powerful system - which should be somehow comparable to the production environment - they are are usually done later in the the application lifecycle, e.g. in a staging environment. Anyone who had the pleasure of taking a non-trivial application from development to staging and to production, might very well know following scenario:

"Burndown Chart"



The burndown chart looks fine during development, unit and functional testing. But as we finish our load-tests on the integration environment, issues keep popping up: Unexpected delays due to network latencies (we now have more physical tiers or services than during dev), congestions on certain services, memory problems due to large session states of many concurrent users, CPU-overhead because of constant conversions of objects between layers and services, inefficient database access strategies, concurrency and locking issues, garbage collector overhead due to caching frameworks and many more. It is in the nature of all of them that that they are hard to identify earlier if we are just looking at performance KPIs like response time. They are only revealed in complex environments with many concurrent real users. If we identify problems late, they tend to become very expensive: The best architects have to work in task forces to solve them, the backlog gets frozen, many load tests are run to narrow down the root cause, reputation might be lost in the market. So our goal is clearly to catch as many of those problems as early as possible. But how do we do that if they show only in complex scenarios? The good news is that we do not have to wait until problems result in bad response time.

Static Code Analysis

Static code Analysis identifies bad code (design and architecture) but as the name suggests, it is static. It can per definition not unveil problems resulting from dynamic dependencies because it would have to simulate the program flow. Let's as an example take a data access service which returns domain objects:

class DataServiceServer {
    getHero(id: number): Hero {
        return this.dataRepo.queryHero("... where id=..."); // Pseudo-code
    }

    getHeroIds(query: string): number[] {
        let ids = this.dataRepo.queryHeroes(query);
        // Check/validate whatever
        return ids;
    }

    dataService.getHeroes(query: string): Hero[] {
        let heroes: Hero[] = [];
        let ids = this.queryHeroIds(query);
        for (let id of ids) {
            heroes.push(this.getHero(id));
        }
        return heroes;
    }
}

Then let's assume we are writing any kind of MVVM app that uses the data service. (We don't want to get into a discussion which data fetching strategy is best for MVVM, so this is just a pseudo code sample.):

class HeroesListComponent {
    public heroes : Hero[] = [];
    private dataServiceClient: DataServiceClient;

    constructor(dataServiceClient: DataServiceClient) {
        this.dataServiceClient = dataServiceClient;
    }

    search(query: string) {
        // DataServiceClient calls DataServiceServer.getHeroes() via REST.
        this.heroes = this.dataServiceClient.getHeroes(query));
    }
}

By looking at this code, we can immediately identify the problem: DataServiceServer creates N+1 queries, N being the number of found heroes. The static analysis can't do much because the consequences depend on the dynamics of N. Unit tests in the development environment would not show any performance problems because everything runs local and even with a few hundred heroes it will not be noticeably slow.

Antipatterns

So load testing and static analysis alone are not good enough. We need something that recognizes problems depending on the dynamics. In the above sample we see the classic N+1 Query antipattern. If we used a tracing tool, we could visualize the problem:

"N + 1"



First we load our hero ids (= first query) and then we load each hero. For 6 heroes, we need 7 queries in total.

Dynamic Architecture Validation

Let's now use AI to analyze our data access. You remeber our previous posts: We used custom events to instrument our client-app.

"N + 1 in AI"



It quickly reveals the anti-pattern: One service call causes many SQL-queries. Thats nice if we try to pin down a problem but it is still reactive.

Use AI Analytics to proactively identify Antipatterns

If we want to identify antipatterns without clicking through every flow in AI (as above), AI Analytics comes into play. It is an analytics engine for querying all the telemetry data that has been collected by AI. It provides a powerful query language and useful charting capabilities.

One of the simplest queries is searching for custom events:

"AI query custom events"



Next we are joining custom events with dependencies and group them by timestamp, event-name and operation-id:

customEvents
 | project timestamp, operation_Id, eventName = name, client_Type
 | where timestamp >= ago(60min) and eventName != "ServiceProfilerSample" and client_Type == "Browser" and notempty(operation_Id)
 | join kind=leftouter  (dependencies)  on operation_Id  
 | summarize nrDeps = count(operation_Id) by timestamp, eventName, operation_Id, type

Finally, we group all by the eventName in the UI:

"Analytics-group by event-name"



And we let Analytics create a chart for us:

"Analytics-group by event-name"



That's what we wanted: Let some tests run in any enviroment (dev, staging, prod) and see at a glance if there any unusual patterns. Even during development, when we are testing with e few hundred domain objects (heroes) only, we know that there will be problem in production.

Last but not least, we put everything on our APM dashboard and that's it!

"Analytics-group by event-name"