KQL Series – quick intro to Azure Data Explorer

Azure Data Explorer is a PaaS offering from Azure providing an end-to-end solution for data exploration.

https://azure.microsoft.com/en-us/blog/individually-great-collectively-unmatched-announcing-updates-to-3-great-azure-data-services/

Here is a quick introduction of the features:

https://azure.microsoft.com/en-us/services/data-explorer/#features

This service from Azure was developed to provide end-to-end data exploration services to help the businesses get quick insights, and make critical business decisions. It can be used for streaming data as well to identify patterns, statistics, anomalies, outliers, and even to diagnose issues.

ADX is a fully managed data analytics service for near real-time analysis on large volumes of data streaming (i.e. log and telemetry data) from such sources as applications, websites, or IoT devices.  ADX makes it simple to ingest this data and enables you to perform complex ad-hoc queries on the data in seconds – ADX has speeds of up to 200MB/sec per node (up to 1,000 nodes) and queries across a billion records take less than a second (!!).  A typical use case is when you are generating terabytes of data from which you need to understand quickly what that data is telling you, as opposed to a traditional database that takes longer to get value out of the data because of the effort to collect the data and place it in the database before you can start to explore it. This allows us to iterate our queries really, really quickly.

Azure Data Explorer has a very fast indexing and a very powerful query language to work with the data. You guessed it – that language is Kusto Query Language!!.

Azure Data Explorer aka ADX, is a fast, highly scalable and fully managed data analytics service for log, telemetry and streaming data. This data exploration service enables you to pull together, store and analyze diverse data. You can query terabytes of data in a few seconds and it allows fast ad-hoc queries over the varied data.

ADX is a PaaS offering from Azure, which is capable of performing analysis on large volumes of data from heterogeneous sources, like – Custom Applications, IoT Devices, Diagnostic Logs, and other streaming data sources as well. This data can be structured, unstructured, or free text.

Data within ADX is organized in relational tables within the database, which has a strongly typed schema. It can scale quickly depending upon the ingested data volume and query load.

ADX works on the principle of isolation between Compute and Storage using volatile SSD storage as a cache and persistent storage in Azure Blob Storage. It is a fully managed ‘Platform as a Service (PaaS)’ that lets users focus only on their data and queries.

Azure Data Explorer can be used with other Azure offerings to create end-to-end solutions for various use cases.

Let’s have a look at some of those use cases for ADX.

#Yip.

KQL Series – KQL usage with Advanced Hunting (this is huge)

This blog post is about another use of KQL that will definitely help your organisation and make you very popular.
First of all check out this video here which can help you learn stuff about KQL:

https://www.microsoft.com/en-us/videoplayer/embed/RWRwfJ?postJsllMsg=true

I have been talking about KQL and using it with Log Analytics and also Application Insights.

We use Log analytics for all our infrastructure things and combine with Azure Monitor to “do” things with our captured data:

But just as equally important is Advanced Hunting – where we can use KQL and start looking at what security events are occurring within our Azure Subscription.

https://security.microsoft.com/v2/advanced-hunting

(NOTE: you do need an Azure subscription to run this)

Advanced hunting is based on the Kusto query language. We can use Kusto operators and statements to construct queries that locate information in a specialized schema.

Here is an example that we can run:

// Finds PowerShell execution events that could involve a download union DeviceProcessEvents, DeviceNetworkEvents
| where Timestamp > ago(7d)
// Pivoting on PowerShell processes
| where FileName in~ ("powershell.exe", "powershell_ise.exe")
// Suspicious commands
| where ProcessCommandLine has_any
("WebClient",
"DownloadFile",
"DownloadData",
"DownloadString",
"WebRequest",
"Shellcode",
"http",
"https")
| project Timestamp, DeviceName, InitiatingProcessFileName, InitiatingProcessCommandLine, FileName, ProcessCommandLine, RemoteIP, RemoteUrl, RemotePort, RemoteIPType
| top 100 by Timestamp

A short comment has been added to the beginning of the query to describe what it is for. This comment helps if you later decide to save the query and share it with others in your organization.

// Finds PowerShell execution events that could involve a download

The query itself will typically start with a table name followed by several elements that start with a pipe (|). In this example, we start by creating a union of two tables, DeviceProcessEvents and DeviceNetworkEvents, and add piped elements as needed.

union DeviceProcessEvents, DeviceNetworkEvents

Set the time range

The first piped element is a time filter scoped to the previous seven days. Limiting the time range helps ensure that queries perform well, return manageable results, and don’t time out.

| where Timestamp > ago(7d)

Check specific processes

The time range is immediately followed by a search for process file names representing the PowerShell application.

// Pivoting on PowerShell processes
| where FileName in~ ("powershell.exe", "powershell_ise.exe")

Search for specific command strings

Afterwards, the query looks for strings in command lines that are typically used to download files using PowerShell.

// Suspicious commands
| where ProcessCommandLine has_any("WebClient",
    "DownloadFile",
    "DownloadData",
    "DownloadString",
    "WebRequest",
    "Shellcode",
    "http",
    "https")

Customize result columns and length

Now that your query clearly identifies the data you want to locate, you can define what the results look like. project returns specific columns, and top limits the number of results. These operators help ensure the results are well-formatted and reasonably large and easy to process.

| project Timestamp, DeviceName, InitiatingProcessFileName, InitiatingProcessCommandLine, 
FileName, ProcessCommandLine, RemoteIP, RemoteUrl, RemotePort, RemoteIPType
| top 100 by Timestamp

Select Run query to see the results.

This truly an amazing usage of KQL in my opinion!!

You can read more about Advanced hunting  (Formerly known as Advanced Threat Protection) here:

https://docs.microsoft.com/en-us/microsoft-365/security/defender/advanced-hunting-query-language?view=o365-worldwide

KQL Series – SQL to KQL Cheat Sheet

This blog post is about how to quickly learn KQL.

Kusto supports a subset of the SQL language. See the list of SQL known issues for the full list of unsupported features.

The primary language to interact with the Kusto Engine is KQL (Kusto Query Language). To make the transition and learning experience easier, you can use Kusto to translate SQL queries to KQL. Send an SQL query to Kusto, prefixing it with the verb ‘EXPLAIN’.

So let’s write some SQL here:

EXPLAIN
SELECT COUNT_BIG(*) as C FROM StormEvents

and we get:

StormEvents
| summarize C=count()
| project C

Reproduced from here:

https://docs.microsoft.com/en-us/azure/data-explorer/kusto/query/sqlcheatsheet#sql-to-kusto-cheat-sheet-1

SQL to Kusto cheat sheet

The table below shows sample queries in SQL and their KQL equivalents.

CategorySQL QueryKusto Query
Select data from tableSELECT * FROM dependenciesdependencies
SELECT name, resultCode FROM dependenciesdependencies | project name, resultCode
SELECT TOP 100 * FROM dependenciesdependencies | take 100
Null evaluationSELECT * FROM dependencies
WHERE resultCode IS NOT NULL
dependencies
| where isnotnull(resultCode)
Comparison operators (date)SELECT * FROM dependencies
WHERE timestamp > getdate()-1
dependencies
| where timestamp > ago(1d)
SELECT * FROM dependencies
WHERE timestamp BETWEEN ... AND ...
dependencies
| where timestamp > datetime(2016-10-01)
  and timestamp <= datetime(2016-11-01)
Comparison operators (string)SELECT * FROM dependencies
WHERE type = "Azure blob"
dependencies
| where type == "Azure blob"
-- substring
SELECT * FROM dependencies
WHERE type like "%blob%"
// substring
dependencies
| where type contains "blob"
-- wildcard
SELECT * FROM dependencies
WHERE type like "Azure%"
// wildcard
dependencies
| where type startswith "Azure"
// or
dependencies
| where type matches regex "^Azure.*"
Comparison (boolean)SELECT * FROM dependencies
WHERE !(success)
dependencies
| where success == "False"
Grouping, AggregationSELECT name, AVG(duration) FROM dependencies
GROUP BY name
dependencies
| summarize avg(duration) by name
DistinctSELECT DISTINCT name, type FROM dependenciesdependencies
| summarize by name, type
SELECT name, COUNT(DISTINCT type)
FROM dependencies
GROUP BY name
dependencies
| summarize by name, type | summarize count() by name
// or approximate for large sets
dependencies
| summarize dcount(type) by name
Column aliases, ExtendingSELECT operationName as Name, AVG(duration) as AvgD FROM dependencies
GROUP BY name
dependencies
| summarize AvgD = avg(duration) by Name=operationName
SELECT conference, CONCAT(sessionid, ' ' , session_title) AS session FROM ConferenceSessionsConferenceSessions
| extend session=strcat(sessionid, " ", session_title)
| project conference, session
OrderingSELECT name, timestamp FROM dependencies
ORDER BY timestamp ASC
dependencies
| project name, timestamp
| order by timestamp asc nulls last
Top n by measureSELECT TOP 100 name, COUNT(*) as Count FROM dependencies
GROUP BY name
ORDER BY Count DESC
dependencies
| summarize Count = count() by name
| top 100 by Count desc
UnionSELECT * FROM dependencies
UNION
SELECT * FROM exceptions
union dependencies, exceptions
SELECT * FROM dependencies
WHERE timestamp > ...
UNION
SELECT * FROM exceptions
WHERE timestamp > ...
dependencies
| where timestamp > ago(1d)
| union
  (exceptions
  | where timestamp > ago(1d))
JoinSELECT * FROM dependencies
LEFT OUTER JOIN exception
ON dependencies.operation_Id = exceptions.operation_Id
dependencies
| join kind = leftouter
  (exceptions)
on $left.operation_Id == $right.operation_Id
Nested queriesSELECT * FROM dependencies
WHERE resultCode ==
(SELECT TOP 1 resultCode FROM dependencies
WHERE resultId = 7
ORDER BY timestamp DESC)
dependencies
| where resultCode == toscalar(
  dependencies
  | where resultId == 7
  | top 1 by timestamp desc
  | project resultCode)
HavingSELECT COUNT(*) FROM dependencies
GROUP BY name
HAVING COUNT(*) > 3
dependencies
| summarize Count = count() by name
| where Count > 3

The part I love the most is being able to use EXPLAIN to translate what I’ve written in for 20+ years into KQL.

Nice one Microsoft!!

#Yip.

KQL Series – creating KQL queries

Basically when we are writing KQL we are asking the following questions:

Does it exist?

Where does it exist?

Why does it exist?

What shall we do with the information?

We will come back to these as we go along – but we use these questions to construct our KQL query.

Here is a simple enough search query:

SecurityEvent
| where TimeGenerated > ago (3h)
| where EventID == 4782
| summarize count() by Account
| project Account, PasswordsCracked = count_

(Note you can run this query here: https://aka.ms/LADemo )

Steps to create this query:

  1. Identify the table we want to run our query against. So here we are running against the Security Event table. The SecurityEvent table contains security events collected from windows machines by Microsoft Defender for Cloud or Microsoft Sentinel.
  2. We make use of the pipe | character which is above the enter/return key (return if you are on a Mac) and we use shift to get it. It is sued to separate commands issued to the query engine. In the example above each command is on its own line.
    A KQL query can be all one single line. As a recommendation, I prefer each command on its own line. It is neater and more organized which makes it easier to troubleshoot when a query fails (as mine typically do from time to time!!)
  3. Just as we would with other SQL languages we want to filter things. If we just ran this – we would get all security events in the past 24 hours…

The Where operator is the best way to accomplish this. You can see here in the example that we’re filtering first by when the occurrence happened (TimeGenerated) and then by Windows Event ID (4782 – the password hash of an account was accessed.).

4. The next step in our workflow is to provide data aggregation. What do we want to do with this filtered data? In the example, we want to create a count of the Accounts (usernames) that produced event ID of 4782 in the time greater than 3 hours ago.

5. Lastly we want to present the data, we use Project for this. We tell the engine we only want to display 2 columns in our results Account and PasswordsCracked.

You’ll see that we go zero results – so let’s change our query and look for logoff events – EventID of 4634

SecurityEvent
| where TimeGenerated > ago (3h)
| where EventID == 4634
| summarize count() by Account
| project Account, LogOffEvents = count_

What do we get now?

Being a data professional we should order our data so we will tell the query engine how we want to order the results.
SecurityEvent
| where TimeGenerated > ago (3h)
| where EventID == 4634
| summarize count() by Account
| order by count_ desc
| project Account, LogOffEvents = count_


Using the Order operator above, we’re telling the query engine that when the results are displayed, I want it shown in order by the LogOffEvents column.


The ‘desc’ in the query in the Order Data step is what produces this ordering. If we wanted ascending order we’d use ‘asc’.

So what did we do here?

It searched our stored security events in the SecurityEvent table for all Accounts that had a successful login in the last 3 hours and we chose to display only the Account and number of log off events per Account in numerical order with the highest at the top.

So far I’ve introduced some new operators and things – but what is a really quick way to learn KQL?

Let’s look at the best cheat sheet ever…

#Yip.

KQL Series – understanding KQL queries (Part 2)

In Part1 we talked about what a query is.

The most common query we will write is a tabular expression statement which is what people usually have in mind when we talk about queries. This statement usually appears last in the statement list, and both its input and its output consists of tables or tabular data sets.

NOTE: Any two statements must be separated by a semicolon ;

We use a data flow model for the tabular expression statement. A tabular expression statement is generally composed of tabular data sources such as the tables we will query, tabular data operators such as filters and projections, and optional rendering operators.

The composition is represented by the pipe character (|), giving the statement a very regular form that visually represents the flow of tabular data from left to right. Each operator accepts a tabular data set “from the pipe”, and other inputs including more tabular data sets from the body of the operator, then emits a tabular data set to the next operator that follows.

So our query will look something like this:

Source | Operator1 | Operator2 | RenderInstruction

  • Source – tabular data sources such as Azure Data Explorer tables
  • Operator – tabular data operators such as filters and projections
  • RenderInstruction – rendering operators or instructions

Example

In the following slightly more complex example, the join operator is used to combine records from two input data sets, one that is a filter on the Logs table, and another that is a filter on the Events table.

Logs 
| where Timestamp > ago(1d) 
| join 
(
    Events 
    | where continent == 'Europe'
) on RequestId 

So for now let’s use tabluar expressions and look at how we construct information out of our data..

Or in other words the workflow in constructing our query..

#Yip.

KQL Series – understanding KQL Queries (Part 1)

In my pervious post I wrote about KQL queries that we will write – we also could use some free some samples that Microsoft put up for us to try things out:

https://dataexplorer.azure.com/clusters/help/databases/Samples

But what is a query? Kudos to Microsoft for putting some great definitions up on their site which I’ve used here.

NOTE:

At times you will see reference to Azure Data Explorer and Azure Monitor.
My first experience with KQL was with Azure Monitor – but I will talk to Azure Data Explorer later in this series. Back to queries…

A Kusto query is a read-only request to process data and return results. The request is stated in plain text, using a data-flow model that is easy to read, author, and automate. Kusto queries are made of one or more query statements.

A query consists of one or more query statements, delimited by a semicolon (;).

At least one of these query statements must be a tabular expression statement. The tabular expression statement generates one or more tabular results. Any two statements must be separated by a semicolon. When the query has more than one tabular expression statement, the query has a batch of tabular expression statements, and the tabular results generated by these statements are all returned by the query.

Two types of query statements:

  • Statements that are primarily used by users (user query statements),
  • Statements that have been designed to support scenarios in which mid-tier applications take user queries and send a modified version of them to Kusto (application query statements).

Some query statements are useful in both scenarios.

here are three kinds of user query statements:

All query statements are separated by a ; (semicolon), and only affect the query at hand.

To be honest most of the time I write tabular expressions – let’s have a closer look at this…

KQL Series – what is Kusto Query Language?

This blog post will detail what KQL is all about…

KQL was developed to take advantage of the power of the cloud through clustering and compute. Using this capability, KQL is designed as a well-performing tool to help surface critical data quickly. This a big part of why it works so well and outshines many other query languages like it. KQL was built for the cloud and to be used against large data sets.

Kusto Query Language is a powerful tool to explore your data and discover patterns, identify anomalies and outliers, create statistical modeling, and more. The query uses schema entities that are organized in a hierarchy similar to SQL’s: databases, tables, and columns.

The query we write in KQL is a read-only request to process data and return results. The request is stated in plain text, using a data-flow model that is easy to read, author, and automate. Kusto queries are made of one or more query statements.

Here is an example query:

StormEvents
| where StartTime between (datetime(2007-11-01) .. datetime(2007-12-01))
| where State == "FLORIDA"
| count

I will show you below how to run this code and also any code that I put in these blog posts – for FREE!!

Where did I first discover Kusto Query Language?
In Azure using Log Analytics – as I was collecting diagnostic data and needed to analyse it:

Where can you discover and try out Kusto Query Language?
Well Microsoft have some free demo sites you can use:

•Log Analytics  https://aka.ms/LADemo

•Application Insights  https://aka.ms/AIAnalyticsDemo

But it gets better….
Go here to run the code above and get the result of “28”

https://dataexplorer.azure.com/clusters/help/databases/Samples

You don’t need to pay anything just have an account registered with Microsoft.

It’s that easy…

#Yip.

KQL Series – why is it called Kusto Query Language?

I use KQL on an hourly basis….

But for a query language – why call it Kusto..?

Where is a funny tidbit of information:

You have probably heard something sounds like Kusto before…?
I know I had – when I was a young kid.

https://www.imdb.com/title/tt0192937/?WT.mc_id=m365-0000-rotrent

KQL is named after Jacques Cousteau. Even today, you can find evidence of this in the Azure Monitor Docs. If you go to the datatable operator page right now, you’ll still find a reference to him in an example that lists his date of birth, the date he entered the naval academy, when he published his first book entitled “The Silent World: A Story of Undersea Discovery and Adventure,” and the date when he passed.

So the puns about deep diving into data, exploring the depths of data to find and present information – basically we are going to be explorers in a sea of data…

#Yip.

Kusto Query Language – a series

This blog post is about a new query language that I have learnt and I really think you need to learn it too. Especially if you are doing ANYTHING in Azure.

Basically about 3.5 years ago I was building some things in Azure and I realised that I need to get diagnostics and metric information – from all the data that I was capturing.I found that Log Analytics had everything I needed from an infrastructure side…

Then I worked with some awesome frontend developer who were using KQL and Application Insights to get telemetry information about their web services hosted in Azure.

I started learning KQL and realised that I needed to start sharing the brilliance of it.

I have felt so strongly about this that I’ve done a number of talks and even videos on it.

The idea behind it is pretty much the same:

Azure is the leading platform for many companies, and to manage your databases and other infrastructure you need to have insights into what is happening.
You need to be able to query the Azure platform, to not only understand your infrastructure but also to leverage monitoring and analytics to react to operational changes occurring in it.
This session will introduce you to the Kusto Query Language (KQL) which will allow you to query a variety of Azure resources.

At User Groups:
https://www.meetup.com/PowerBI-SQLSERVER-DataManagement-Christchurch-NZ/events/284706814/

https://www.meetup.com/Southampton-Data-Platform-and-Cloud-Group/events/282380773/

Videos: (excuse the hair….)

https://techcommunity.microsoft.com/t5/azure-sql-blog/kql-the-next-query-language-you-need-to-learn-data-exposed-mvp/ba-p/3042982

I knew I was onto a good thing when my close mate Steve Jones blogged about it here:

So I have decided to start a series of blog posts here that talk about the magic-ness of KQL.

Here we go……

#Yip.

Azure Platform Series – renaming your YAML pipeline in Azure DevOps

This blog post is about YAML pipelines in Azure DevOps.

I had a repo called InfrastructureAsCode for a client. I have been transitioning them to use YAML for their build pipelines in Azure DevOps – because now we can source control the build itself.

If you do not know what a YAML pipeline is in Azure DevOps check it out here:

https://docs.microsoft.com/en-us/azure/devops/pipelines/yaml-schema

You can try this for yourself by using the awesome AzureDevOpsLabs:

https://azuredevopslabs.com/labs/azuredevops/yaml/

I had created a pipeline using YAML – which was called InfrastructureAsCode as the YAMP file was in the root directory.

However I wanted to move it into a folder .\InfrastructureAsCode\pipelines\… and run the YAML file from there – as I would have a non-prod and PROD version of them (as the schedule was different for each).

That went well – but I didn’t like the name InfrastructureAsCode as my pipeline name. And it got worse when I added the PROD version of the YAML file – I now had InfrastuctureAsCode(1).
That’s even uglier.

So I needed to rename the pipeline – but how do you do this with YAML??

Well – click your pipeline and edit it and then up in the right hand corner is the 3 dots – click that and choose Triggers:

Click on YAML tab:

And now you have renamed your pipeline!!

Yip.