A place for my thoughts and experiences the Microsoft Data Platform
The data ecosystem explained in 10 simple images
Couple of weeks ago I was explaining how IoT and BigData can be a game changer in sports and human activities. Yeah, it’s cool to put a sensor in your own grill so that it can automatically tell you the exact time you should change side of your filet, but I think it’s even more cooler to instrument the human body and objectively quantify human activities, especially in sports and health care, so that we can even enjoy more and more the sports we love and the life we live. After all knowledge is power, right? And the first step to knowledge is the ability to measure: IoT is really a big deal here.
But then? From a business perspective, how and where all the gathered data will be stored? What all terms like Data Lake, Data Science and the likes really mean? Surprisingly enough, the answer is as simple as describing the life cycle of something that everyone know very well: orange juice.
I created a slide deck, named Data Juice, to explain the whole data ecosystem to C-level guys that had to understand the business value of every piece in such ecosystem, but without having to deal with technical details. It had to be clear, concise and easy to remember. The juice sample was just perfect, as you we’ll see.
I created a slide deck to explain the whole data ecosystem: it had to be clear, concise and easy to remember.
Now, the same concept applies to IoT: the sensor working right now in the wearable you’re wearing, almost surely collect raw data (acceleration, magnetic field, etc) at something around 50 times per second (up to 100 sometimes). But you care about is not that raw value, but the trend of your preferred exercise performance.
What’s the line that connect those two points? Happily enough for me, answering to this question allows me also to describe what kind of work I do, and how it is related to sports, which was one of the question I’ve been asked recently by a well-known tech & lifestyle magazine.
Here’s the answer:
New data is born everyday.
New data is generated everyday, everywhere. Accounting systems, IoT devices, machine logs, clickstreams…anything generate some kind of data. At this stage data is generated and consumed, usually, locally and in very small sizes. For example an accounting system generates an invoice, and allows users to interact with it. One a time. Just like oranges that are born on the tree. You can just grab one and consume it in place.
Oranges are harvested and sent to factory for processing. The factory needs to be prepared to get oranges of any size, kind and color and different ripe status. Sometimes it will get only one load a day, sometimes several per hour. What happen with data is really similar. The famous 3V, Volume, Variety and Velocity says that your data system must be able to accept almost any kind of data. Again, just like with oranges: if you want to be in the business, you better be prepared to handle them when they arrive.
All the ingested amount of data needs to be stored somewhere. And it better be a place where space can grow as need and it should also be pretty cheap if you don’t want to spend all your money in storing everything, even if you don-t know if and how you will use that data in future. Such place also need to support querying and processing capabilities in order to allow you to dive in such lake. You don’t really care too much about shape, format or correctness here. You just store data that as it was generated, just to have it in case you need it. Even if it may be not good for consumption.
You just store data that as it was generated, just to have it in case you need it. Even if it may be not good for consumption.
Once named also ETL, Extract-Transform-Load, this is the process where data is selected, cleansed, standardized and integrated with additional informations, if available. More or less how oranges are cleaned, divided in different groups for different usage, and if not in good condition, discarded.
Once data has been processed by the previous system, it needs to be stored in safe place. Safe not only in the meaning that no-one can steal from it, but also in the meaning that everyone who will get data from it, can assume it is safe to consume. Again, just like oranges, if you get one from a market warehouse you can assume safely enough that you won’t be poisoned. The same goes with data. If you get data from the Data Warehouse, you can safely assume that it is safe to consume, meaning that you can use it to take informed decisions. The same does not apply to a Data Lake, where usually you don’t really know what you’ll get and how to deal with it.
If you get data from the Data Warehouse, you can safely assume that it is safe to consume, meaning that you can use it to take informed decisions.
Once you get your oranges from the store, you can decide how to consume them. If you have hundreds of boxes you won’t consume it one by one. You’ll surely try to transform it in a more easy to consume product. Depending on the target, you may want to process it in different ways, in order to make sure that everyone gets exactly what they need, no more and no less. A Data Mart is exactly this: data ready to be consumed with very little additional processing, if needed at all.
Of course, different people, or different situation may require different aspects of that data, so having more than one Data Mart is quite common. If something is not clear enough in the Data Mart, one can always go in the Data Warehouse that lies behind and check how and why such unexpected data has been produced.
Also known as Self-Service BI, this is where you’re not happy with the Data Mart, since none of them provides what you exactly need, you have to do it yourself. You need to squeeze the juice of three different oranges, of three different kind, in order to create the perfect mixture you want. You go in the Data Warehouse (usually, but sometime you can also grab data from different Data Marts) and create your mix. Maybe adding a hint of cinnamon taken from the outside world.
Using all the things described so far, you apply the Business Intelligence process to figure out how to drive your business, check how it is performing and analyze the outcome of defined strategies to stay in business and, possibly, be successful.
What to do with all the vast amount of unused data in the Data Lake? Maybe among the oranges that does not fit the current sales plan and target, there is the next “big thing”. Something you can leverage to create a new product, sell a new service or something that can help you to understand if there is anything in the collect-store-process-enrich life cycle that can be improved. Usually this means that you are looking for question, and not for answers. This is the main difference between the process of doing Business Intelligence and the process of Data Science: in the first you look for answers to questions you know, in the second you look for questions you don’t know yet.
This is the main difference between the process of doing Business Intelligence and the process of Data Science: in the first you look for answers to questions you know, in the second you look for questions you don’t know yet.
In conclusion this is what the information technology, from a data standpoint, is all about: squeezing every drop of information out of all the data we have, so that we can gain knowledge that ultimately brings to wisdom.
We squeeze every drop of information out of all the data we have, so that we can gain knowledge that ultimately brings to wisdom.
So now, when someone ask me what kind of work I do, I say that “I squeeze information out of raw data”. Easy, understandable and, actually, really precise.
I've been involved with PASS, once named Professional Association of SQL Server users, now just...PASS, since 2006 when I stepped up as president of SQL Server Italian User Group. Now that I moved to Redmond, WA, I'm still involved with it, but of course, I'm not leading the Italian User Group anymore.
Since I do really think that community activity is vital for an healthy and successful career, as soon as I noticed that the Application Development Virtual Group was not very active anymore I decided to offer my help and so here I am, I've just stepped up as new leader of that group, AppDev for friends:
The focus will be on the entire Microsoft Data Platform, so not only SQL Server, and the languages we'll try to cover are, of course C# and .NET in general and also, more broadly, any language that can interact with the Microsoft Data Platform. So Python (my passion!), Java, Ruby, F#, Rust, R....anything you can think of. Of course everything approached from a developer perspective. No administration here, no sir. Security? Yes. Performances? Sure, give me more! Best Practices? Absolutely! Everything that can be of interests to developers and architects will be in our scope (given that is somehow related to the Microsoft Data Platform, of course).
I've just started to contact some great speakers to start with the right foot, and the first just confirmed his support. Jovan Popovic, SQL Server / SQL Azure PM will present a session on JSON, one of the most requested and powerful feature in SQL Server 2016 and next:
Building REST API with SQL Server using JSON functions
Registration is free both to the seminar and also the the Virtual User group. You can find link to registration and additional info on UG homepage:
I recommend to register to both, so you won't miss a very interesting webcast (that will be recorded and made available on YouTube for future reference) and you'll also stay informed on the future activities that will be, I can assure you, just amazing!
And, last but not least, if you have a topic you want to present, just let me know.
I want to be clear about the target and the scope of the “For the Better Developer” series.
The articles doesn’t want to be 100% precise, but the aim is to give you, the developer, who understand how important in today work is the ability to manipulate data, a good knowledge on how to use some features and not how they really work behind the scenes. I hope that you will be interested in learning how things works and the section “for those who want to know more” you can find at the end is there with that specific purpose in mind. But again it doesn’t really make sense to dive really deeply in how things works since there are a lot of extremely good book and articles and documentation and white papers that already do this. My objective is to give you enough information to
- help you in your work
- spark in you the will to know more
so sometimes I have to simplify things in order to make them consumable
- by everyone, especially those who doesn’t have a deep database background yet
- in just a bunch of minutes, since you’re probably reading this while at work, and you surely have some work to deliver within a deadline that is always too near
The article in the series assumes that you already have a basic practical knowledge of a database. For example you know how to create a table and how to query it.
The section “for those who know more” is created for those who already something more deep about database and that found simplifications in the article too…simple. Here I explain and clarify the topics I had to simplify a lot, sometimes even maybe telling a lighter version of truth, in order to make it easily accessible to everyone. But truth needs to be told, and in this section is where I tell it.
After 15 years in Database Performance Tuning, Database Optimization, Data Modeling, Business Intelligence, Data Warehousing, Data Analytics, Data Science, Data Whatever, I’m back in the development space, where I lived also for the first 10 years of my career.
Huston, we have a problem
After 15 year I still see the same big problem: the huge majority of developers doesn’t really know how to properly deal with data. It’s not a critic or a rant. Is just a matter of fact: especially several years ago, if you spoke to a developer, it was clear that dealing with data and especially a database was not something he felt was part of his job.
This is not true anymore, any developer today knows that a bit of knowledge of database is needed. No, better: is mandatory. Still, that way of thinking that was common years ago created a huge hole in developer knowledge bag that is showing its bad effects now. If you didn’t learn how to properly deal with a database, it’s now time to fill that gap. Unfortunately 15 years is a lot of time and database evolved immensely in that time frame, especially relational databases, so trying to recover the lost time it may be really hard. Still, it’s mandatory. for you, your team and the company you own or you work for.
I do really believe in community and sharing so I decided to try to help to fix this problem. If you want to read the full story, go on, otherwise you can just skip to the last paragraph. As always, its your choice.
The market changed, asking for more integrated and interdisciplinary knowledge in order to be able to react more quickly to changes. Developers needed a way out: database were just too old and boring, and they really didn’t want to learn SQL, that freakin’ old language.
The NoSQL movement for a while seemed to come to the rescue, promising developers to give them something that will take care of data for them. Auto-Indexing, Auto-Sharding, Auto-Scaling, Auto-Replication. JSON queries. JSON data. JSON everything. No Schema. Just freedom, beers and peace for everyone.
The world today
Of course that was (obviously) too good to be true. Even NoSQL databases, now that they have matured, needs to have data properly modeled, indexes correctly defined, used and maintained, not to mention security, transactions, and techniques to optimize performances.
There is a big difference between now and 15 year ago: now we live in a world where data is the center of everything and so developers needs to know how to handle it. Look at any work offer for developers and you will find that some degree of database knowledge is always required.
You probably already know how to query several database, be SQL or NoSQL: you had to learn it to survive.
Become a Better Developer
Frankly speaking, surviving is just not that nice. Evolving, thriving, accomplishing something bigger is nice. And to do that, you really be able to make the difference. To stand out from the crowd, there is only one thing you have to do. Become good, extremely good, in dealing with data.
You don’t have to become a DBA or a Data Scientist if you don’t want to. But dealing with data in a position of control and not just trying avoid dying under the data weight is something any good developer should learn.
Do you want to use MongoDB? Go and become the champion of MongoDB. You are excited about the new Google Spanner? Go and learn it like there is no tomorrow. Choose one or more database of your choice and master it. It will make a whole difference, since you will have the power to turn the data in what you want, without having to reinvent the wheel every time.
More and more computing power is moved within the database. It is just much more easier to move compute logic in the database instead of moving the huge amount of data we have to deal with everyday out to the application. This is already happening and will continue in the foreseeable future: knowledge of a database is now more important than ever. Performances, and thus costs, depends on how good you are in manipulating data.
The cloud make all these reasons is even more important, since the relationship between performance and costs is stronger than if you are on-premises.
A good data model, the correct index strategy and a cleverly written code can improve your performance by orders or magnitude. And reduce costs at the same time.
I have chosen SQL Server for myself. I think that nowadays is the most complete platform that supports all the workload one can ask and offers NoSQL, InMemory and Columnar capabilities all in one product. All with exceptional performances and with a great TCO. Plus, it lives in the cloud.
Just to make it clear: I’m not a fanboy. I’ve also studied and used MongoDB, CouchDB, MemCached, Hadoop, Redis, Event Store, SQL Lite, MySQL, ElasticSearch, PostgreSQL. But since the Microsoft platform is where I work most of the time, it make sense for me to use SQL Server.
To be 100% honest and transparent: I’m also a Microsoft Data Platform MVP. But that’s not why I like SQL Server and SQL Azure so much. It is because I wanted to be a developer who can make the difference that I have learned SQL Server so well which in turn drove me to be recognized as an MVP. Since I’m a data lover, I always take a deep look at what the market has to offer: I always go for an objective and rational approach.
“For the Better Developer” series
I really love development, and I love it even more when I am among talented developers who love their job and know the technologies they work with extremely well, because when that happen, great things happen as a result. When that happen my work become pure joy. It becomes poetry. I’ve been lucky to experience such situation several times in the past. And I want more.
So I want to increase my chances to be in that situation and that’s why I’m starting a series of articles that will try to help everyone who want to become a better developer with SQL Server and SQL Azure.
If you’re a developer that needs to use SQL Server or SQL Azure, you will find in articles of the series a quick way to learn the basics and also learn where to look for more deeper knowledge.
Here I’ll keep the updated list of the articles I have written so far:
SQL Server Indexes
Indexes are a key factor for a database, relational or not. Without indexes the only option for a database would be to read all the data and discard what is not needed. Very inefficient.
If a query has a where or a join or a group by clause and you feel performances are not good, before trying to figure out how to cache results in order to improve overall application performances — which will surely help but it will put on you the burden to maintain the cache, making the solution more expensive and complex to develop, maintain and evolve — keep in mind that you will surely benefit from using an index. How much is hard to say, but I would expect performance improvements in the order of 10x to 1000x. Yes: index can make such difference.
More on caching and other stuff is discussed at the bottom of the article, in the “For those who want to know more” and “For those who know more” sections.
Wondering who I am and why I’m writing these posts? Read the introductory article here. Also be sure to read the notes about the “For the Better Developer” article series, so that you can know what to expect.
The indexes you have to know
There are two principal types of indexes in SQL Server: clustered and non-clustered. The first main difference between the two is that the clustered works directly on table data, while the non-clustered works on a copy of such data.
Both indexes can supports two different storage models: using a row-store or using a column-store.
There are therefore four possible combination that we have to choose from when we need to add an index to our tables.
Clustered Row-Store Indexes
An easy example to explain this index is the encyclopedia or the phone number directory. Data is stored accordingly the order specified by the index. More specifically, by the columns declared when the index is created.
create clustered index IXC on dbo.Users(Surname, Name)
In the example above, rows will be stored ordered by Surname and then Name. Just like the phone number directory, duplicates are allowed.
Each row will also be stored along with all the remaining columns values. If the table dbo.Users has the following definition:
create table dbo.Users
Id int not null,
Surname nvarchar(50) not null,
Name nvarchar(50) not null,
DateOfBirth date not null,
PhoneNumber varchar(50) not null,
Address nvarchar(100) not null,
City nvarchar(50) not null,
State nvarchar(50) null,
Country nvarchar(50) not null
the index row will contain not only the indexed columns but also the Id, DateOfBirth, PhoneNumber, Address, City, State and Country. This happens because, remember, a clustered index works directly on table data.
Of course, since a clustered row-store index works directly on table’s data, there can be only one clustered index per table.
When you should use this type of index? It is perfect if:
- you are looking for zero, one or more rows in a table
- you are looking for a range of rows: from “A” to “B” for example.
- you need to have the rows returned in a specific order
As you can see it fits all usage scenarios and, in fact, the clustered row-store index is very flexible and can be useful in almost every scenario but unfortunately, as already said, only one can exist per table. This limitation is imposed by the fact that such index physically orders data in the table and, of course, you can order data only in one way at time.
If you need to order it in more than one way at the same time, you need to duplicate some or all data, and that’s why Non-Clustered Row-Store Indexes exists.
Non-Clustered Row-Store Index
A Non-Clustered Row-Store index is somehow like the index you can find at the beginning of a magazine, where the topics are listed along with a pointer to the page in which they can be found.
In the case of SQL Server, when you create a non-clustered row-store index on a column — and following the phone number directory example let’s say the PhoneNumber column — all values in that column will be indexed and each one will have a pointer that allows SQL Server to relate that value with the row in the table that contains it.
create nonclustered index IXNC on dbo.Users(PhoneNumber)
Again, please note that values in the chosen columns needs not to be unique.
The non-clustered index requires additional space to be stored, since it creates a copy of all the data in the columns you specified in the index definition, plus the space needed for the pointer.
When the index is used to locate some data, the result of that search is a pointer. This means that in order to get the real data a lookup operation needs to be done. For example, given a PhoneNumber you want to know who will be answering if you call it. SQL Server will used the index of the PhoneNumber and after having found the number you are looking for, will take the pointer and use it to find all the related values like Name, Surname, Address and so on. A lookup is nothing more that a jump to the location indicated by the pointer in order to access the real data.
Of course this additional operation has a little overhead. Overhead that is directly proportional to the number of rows for which the lookup is needed. Are you looking for a thousand rows? A thousand lookup needs to be done. The consequence of this behavior is that it doesn’t make sense to always use an index, even if it exists. If the effort to do the all the lookup (which can be an expensive operation) is too high, compared to reading the full table and just discarding the rows that doesn’t fall in the scope of the requested search, the index will not be used.
As a general rule, than, it can be said that this index is perfect when you are looking
- for zero or one row in a table
- for *very few* rows in a table
as a rule of thumb, “very few” mean 1% or less of the rows in your table.
Clustered Column-Store Index
This index effectively turns SQL Server into a columnar database. Really. Once created on a table, since it is a clustered index, it will change the way SQL Server stores that table: instead of saving data one row at time, data will be saved one column a time, following the principles of columnar databases. For that reason you don’t really need to specify any columns when creating such index, since it will be applied to the whole table and thus all columns will be indexed:
create clustered columnstore index IXCCS on dbo.TrainingSessions
Column-Store index are perfect for analytical queries. All queries in which data needs to be grouped, optionally filtered and aggregated fall into this scope. Performance can be easily increased by a factor of 10x and it’s quite common to see improvements of 30x and more. This is also a result of the fact that, to work properly, columnar solutions needs to compress data and SQL Server makes no exception: the clustered column-store index will compress data in your table using several techniques, and you may see you table space shrinking a lot, like up to 60%-70% or even more.
Column-Store index are especially good in read-intensive scenarios, while they do not excel in being written very often. There is an exception for bulk-load scenario. A bulk load is activated when when you use the T-SQL command BULK INSERT or load data into SQL Server using the SqlBulkCopy object or if you use any solution, like SQL Server Integration Services, that supports the Bulk Load option.
If you have a table in which the majority if the queries have this pattern:
and write operations, compared to read operations, are a small number, then give this index a try. You will probably be amazed by how quickly data can be aggregated and manipulated, even on huge tables. (Huge: hundreds millions rows or more)
The downside of this index is that while it is great for working with aggregations, is not that good retrieving one or few rows with all their columns. So you may want to mix the row-store index and the column-store index together, and that’s why the next, and last, index on the list may be very useful too.
Non-Clustered Column-Store Index
This index allows you to create a colum-store index only on a subset of columns of your table.
create nonclustered columnstore index IXNCCS on dbo.Users(City, State, Country)
This is helpful when you need to have your table both supporting a row-store index, that is quick for retrieving whole rows, but you also need to quickly aggregated on some columns of your data.
Using again our phone directory example, the above index will make any aggregation on City or State or Country or a combination of them, really really really fast.
Let’s make a more real-life scenario: if you have a database that stores training sessions, for example, and you what to be able to present to your user a nice dashboard with aggregated information of the last six month of the most useful indicators like calories burnt, number of training done and so on and you also need to be able to quickly retrieve a single or a bunch of rows to show all the data you have for a specific training session, the non-clustered column-store index, especially when used together with a clustered row-store index, is just perfect for this job.
Creating, altering or dropping and index may have huge impact on the server. The bigger the table the bigger the impact. Depending on the size of your data, such operation can take seconds or hours and in the meantime all users trying to use the database will be affected in a way or another. So before playing with indexes, test the script on a test server to evaluate the impact.
You now have the basic foundation to understand how to make application go faster. Much, much, much faster.
Take the sample done above, remove training and users and put your own objects like invoices, bills, orders, whatever you want, and you can easily understand how SQL Server can be perfect to manage a real-time analytic workload and at the same a time also a transactional workload, which is a quite typical requirement.
This is a really unique feature in the DBMS space today.
My opinion is that, if you have less than 1 TB of “hot” data (data updated quite frequently during that day) you don’t really need to look for Big Data solution since the overhead they will put on you and your team can be quite big that I can hardly justify for one 1 TB of data. If you have 1 PB or near, than, let’s talk.
Indexes are really a huge topic, here we just scratched the surface and even just a bit. I’ll surely talk more about indexes in next posts, but if you don’t want to wait and you’re interested in the topic, which is my main objective with this post series, that you can go for these external references:
For those who want to know more
SQL Server has its own internal cache. Actually, it has several different caches. The most important is the buffer cache. Here is where data live. In an healthy situation 99% of your data comes from the cache. Yeah, right, 99%.
Row-Store indexes are implemented a B+Trees. The technical difference between clustered and non-clustered is that for a clustered index in the leaf pages you have the actual data, while in the non-clustered you have data related only to the columns used or included in the index, and the pointer.
Clustered Row-Store Index
SQL Server doesn’t really enforce the physical order of the data after the index has been created. Data is ordered following the index definition during the index creation phase, but after that, if there are operation that inserts, deletes or updated data, than the physical order is not preserved. What is preserved is the logical order of the data. This means that data will be read in the correct sequence which may be different than the sequence in which rows can be found on the disk. This logical/physical difference is the index fragmentation that is not necessarily a bad thing. It usually is a bad thing, but not in 100% of the cases. For example if you always access your data by one row at time, you care a little about fragmentation. Or if you want to insert data as fast as possible, you don’t really care about fragmentation; on the opposite, you may want to have fragmented data in order to reduce the chances of having hot spots where contention will happen and thus your performance will suffer. Anyway, as usual, edge cases like the two described must not be used as the general rule.
Non-Clustered Row-Store Index
You can have more than one non-clustered row-store index on a table. Of course you can create such index on more than one column. This enables some nice “tricks” like the covering index that I will describe in future articles.
There are many more index type then the one I described. They are specific to a certain data type, like geospatial indexes or to a specific feature that I will discuss in another post, like the Hekaton, the In-Memory, Lock-Free, engine. The indexes I described in this articles are the most commonly used.
For those who know more
Caching is usually a nice thing to have and I also use caching solution quite a lot. I just don’t want to use it as a surrogate of performance I can have right from my database. Caching is perfect, in my opinion, in highly concurrent systems to reduce the stress on your database so that you can deliver the same user experience with just a less expensive solution (be a smaller hardware on-premises or a lower performance level in the cloud).
Keep also in mind that SQL Server, offers an extremely good option for creating customized caching solution using In-Memory Tables, as Bwin, the online betting giant, demonstrated. So if you already using a SQL Server version that supports this feature, it may be worth to give it a try. Otherwise Redis and the likes are just perfect.
The “User” table
Yes, the table used in the sample has several design problems (both logical and physical), for example:
- there is a functional dependency between City, State and Country)
- there is no Primary Key defined
- data types are really not the best possible
but I just want to show something that is as close as reality as possible so it will be easier to relate the example to your own experience. In addition to that I’ll talk about Keys and Indexes in another dedicated articles, so here I went for something that just plain, simple and easy to understand.
You’re really come that far?
Well, thanks for reading. :)
I’m happy to share that I’ll be speaking at the next SQL Server Seattle User Group meeting on April 13th. The topic is again Temporal Tables a subject I love a lot and that is 100% useful in possible roles: dev, dba or bi guy.
Temporal Tables with SQL Server 2016, vNext and Azure SQL
Keeping track of how data changed over time in a table has always been a difficult task. Both data insertion or modification and even querying is just more complex when you want to have the result that was returned at a specific point of time in the past. And even more complex when you’re not looking for a specific point in time, but a period of time. Temporal database theory and temporal operators surely can come to the rescue, but they are not a matter for the faint of heart! Luckily one of the biggest - and most requested – feature that has been added to SQL Server 2016 solves exactly this problem, allowing the creation of change audit trails, data history and point-in-time queries in such a simple what that anyone, even on *current* applications, can benefit from it, simplifying solution architecture and saving time (and money) on maintenance an reporting.
In this session we’ll see how the feature work on SQL Server 2016 and Azure SQL v12 and also what will be available in the vNext version of SQL Server.
Additional details, location and free registration are here:
On April 15, 2017 I’ll be speaking at SQL Saturday in Redmond, WA. I’ve settled down after the recent changes, so it’s time to start to be an active part of the Microsoft Data Platform (former SQL Server) community again.
I’ll be talking about monitoring SQL Server Integration Services and the amazing support for JSON in SQL Server 2016:
SSIS Monitoring Deep Dive
In this session we’ll see everything interesting is hidden in the SSISDB database, where you can gain a lot of insight on the outcome, the performance and the status of your SSIS Packages. I’ll share everything I’ve learned building the SSIS Dashboard we’re actually using in production and that you can test here http://ssis-dashboard.azurewebsites.net/. We’ll see the internals of SSISDB database, how we can add custom logging information and how we can use all these data in order to know exactly what happened on a specific point in time.
JSON & SQL Server: using it effectively
JSON is a de-facto standard for machine-to-machine communication but also to store “liquid” data. Once a killer feature of NoSQL database, with SQL Server 2016 is now available also in a (post) relational environment.
But how it fits in the relational database space? How the usage of JSON can simplify DBAs and DEVs everyday work? What about performances? And what about flexibility?
In this session we’ll see how powerful JSON can be, when used correctly and not abused, providing solution to common problems
(like in the well known “dynamic schema” case), simplifying development (for example to pass to Stored Procedure complex objects or variable numbers of parameters) with also very good performances, even if they maybe hidden in not-so-obvious places.
The event is going to be an amazing one, as you can see from the agenda:
SQLSaturday #613 - Redmond 2017
So, if you live near there, don’t miss it.
See there in a bit more than a month!
During the weekend I’ve updated my SQL Server interpreter for the amazing Apache Zeppelin to be compatible with the latest version — 0.7— that has been released just couple of days ago:
Apache Zeppelin Release 0.7.0
This new release has a lot of nice features (customizable charts, better multi-user support, improved tables and many others) and in order to start to use the new features also with SQL Server and SQL Azure, the first step was to sync the existing interpreter with version 0.7. Well, that’s done.
You can download the source code and build it yourself from here:
GitHub - yorek/zeppelin: Apache Zeppelin with support for SQL Server
or get the Docker image with everything already built, for maximum comfort:
If you’re new to Apache Zeppelin, you can take a look at the articles I’ve already written to help people to get started:
In order to take advantage of automatic interpreter discovery and native syntax highlight introduced in version 0.7, I had to rename the interpreter group and the interpreter name to sqlserver from tsql. This means that any binding to the tsql interpreter will be invalid which, in turn, means that all notebook using it won’t work. Luckily fixing this situation is easy: just create a new interpreter binding, using sqlserver as interpreter group and give it the same name you used before. Make sure the notebooks affected use this new binding (as the default one if in your notebooks you didn’t use the %interpreter notation) and you’re done.
Updating Docker Container to version 0.7
If you were already running a docker container with Apache Zeppelin for SQL Server, you may have notebook and interpreter configurations that you want to keep.
The process to update everything without losing any existing work is the following (just be sure also to read the “Breaking Changes” section above!):
Get the updated image
docker pull yorek/zeppelin-sqlserver:v0.7
Once it’s downloaded, check if you have any running container
If you see an Apache Zeppelin container is running (it is named zeppelin if you followed my tutorials), stop it
docker stop zeppelin
Now create a new container pointing to the volumes of the previous version. The new container will automatically use the updated image
docker run -name zeppelin2 -p 8080:8080 -volumes-from zeppelin -d yorek/zeppelin-sqlserver:v0.7
List again all the existing containers (running and not running)
docker ps -a
Note that the old zeppelin container the is still there and if not needed can be removed
docker rm zeppelin
The new container can now be renamed to the usual name
docker rename zeppelin2 zeppelin
The old image can also be removed to free disk space
docker rmi yorek/zeppelin-sqlserver:v0.6.2
Now check that only one zeppelin container, with the version 0.7, is available:
Upgrade done! Now you can start Apache Zeppelin for SQL Server with the usual command:
docker start zeppelin
and you’ll have access to all your existing notebooks and interpreter bindings.
A common problem
Recently we had to setup a couple of contact form pages to allow users to register to the App 2.0 Beta program (now RTM) or to tell us which of the products we launched at CES 2017 they are interested into.
Such kind of request are quite frequent and usually, from a developer perspective, they are the worst thing someone can ask. They are boring to architect, boring to write, boring to test. Still, business and marketing needs and love them.
So how to satisfy this requirement in the fastest and easiest way, but still delivering high-quality and easy to maintain code, while also keeping an eye on integrating the data with the existing environment?
Evaluating Azure DocumentDB
Given that we are on Azure, one option was Azure DocumentDB. No need to define a schema upfront and it is usable just using REST so the UX developer could do anything on its own, just using JQuery. Azure DocumentDB can also be queried using a SQL language, so extracting data from it wouldn’t have been a problem.
But at the end, I decided to go another way. All our data, after swimming in a Data Lake are saved in a Azure SQL database where we heavily rely on its column-store features. So having an additional database to integrate was something that would have made the solution a bit more more complex than the simplest possible one. The famous quote
everything should be made as simple as possible, but no simpler
is what drives all my architectural decisions, so I wasn’t really happy about going with Azure DocumentDB.
With the additional fact that there are no really good tooling around Azure DocumentDB yet, I started to look for alternatives. The obvious alternative, aside from saving data into a blob, which was not on option since that would have been too simple, because it doesn’t offer any querying capability, was to use Azure SQL.
Moving to Azure SQL?
With Azure SQL you have great tooling (SQL Server Management Studio and now also the online query editor), we already have all knowledge in house, but surely the fact that it doesn’t allow to use just REST to read and write data was, again, something that wasn’t making me really happy.
Beside that, Azure SQL seemed to be the perfect option. JSON is now natively supported, so there is no problem to store data without a strictly enforced schema.
Since we’re already using SQL Azure, we wound’t even have to spend any additional cent for it. The only problem to solve was that you can’t use Azure SQL directly via JQuery.
Serverless is the key
The missing link — the REST interface — can easily be created using Azure Functions and a microORM like Dapper. Thanks to the serverless nature of Azure Functions all it’s need are the few lines of code to get the HTTP Request Body that contains the contact form “jsonifyied” data and store into the SQL Azure database.
The created Azure Function gets called each time an HTTP Request is done, using an HTTP Trigger. Here the function.json file that defines the function bindings:
and the function full code is here:
Such additional layer is also a welcome addition since it allows to inject some additional validation checks and business if needed.
I’ve used a Stored Procedure here for better decoupling, and it does just a simple insert, with some additional checks.
It also turns some JSON data into columns, to make querying even easier.
Dapper helps to make interacting with SQL Azure a breeze (two, yes two, lines of code to open the connection and write to SQL azure), so all simplicity requirements are finally met. No more impedance mismatch. No more.
Given the existing external factors (resources, knowledge, integration) the simplest solution but not simpler has been achieved.
Without any existing factors I would probably have chosen Azure DocumentDB. But as an afterthought, I still have the gut feeling that SQL Azure would have been better in the long term (for example I can connect Marketing Manager’s Excel directly to the SQL Azure, something that I cannot do yet with DocumentDB)…so maybe SQL Azure would be my choice onve again. After all software is made to serve people, and this should drive at least 50% (if not more) of the architectural choices, IMHO.
Since I joined Sensoria I’ve moved to work on the cloud 100% of my time. And the more I work on the cloud, the more it is clear to me that serverless and simplicity (which means, implementation-wise: microservices) is the way to go. Efficiency is increased so much in this way that it’s really worth the effort, and embracing PaaS clearly becomes an obvious choice.
Now that Apache Zeppelin has been downloaded and configured, it’s time to create a dashboard. It will be very easy: all you have to do is figure out which data you want to show, write the corresponding T-SQL query and then add some charts and information to make it perfect.
To create the first Apache Zeppelin dashboard, let’s use the new Wide World Imports sample database from Microsoft:
Take the “Standard” version so that you can use it also on a non-premium Azure SQL instance, if you want to try it on Azure.
Once you have restored the WideWorldImportersStandard database, run Apache Zeppelin 0.6.2 for SQL Server docker container:
docker run --name zeppelin -p 8080:8080 -v /zeppelin-sqlserver/notebook -v /zeppelin-sqlserver/conf -d yorek/zeppelin-sqlserver:v0.6.2
make sure you have configured an interpreter (call it “Azure” for example) with the connection information to point to WideWorldImportersStandard and than create a new note:
And now it’s just a matter of writing the queries and turning the result into charts. As usual I’ve recorded a quick video (it’s just six-something minutes) to easily show how you can do that. (And I’ve also started from configuring the interpreter so that you can also review that part).
In the video I also show how the markdown interpreter can be used to add information to the dashboard.
The sample dashboard, that also includes the queries, can be downloaded here:
My First Dashboard.json
If you prefer to download only the queries and then DIY, here’s a SQL file with all the used queries:
My First Dashboard.sql
I really recommend you to start using Apache Zeppelin if you haven’t done it yet. It’s incredibly useful even for DBAs just to monitor SQL Server status. I’ll talk about this in a forthcoming post. Stay tuned!
In order to be able to keep created notebooks and interpreters configurations when upgrading the docker image to a newer version, I changed the dockerfile to use docker volumes, as described here (if you’re interested into learning more):
Manage data in containers
to run the container, now, the command is
docker run --name zeppelin -p 8080:8080 -v /zeppelin-sqlserver/notebook -v /zeppelin-sqlserver/conf -d yorek/zeppelin-sqlserver:v0.6.2
The -v does the trick and will be very useful the first time a new image will be released, so that you’ll be able to keep all your notebooks without having to export them before and, in addition, also interpreter configurations will be preserved.
The solution used until now (sharing a volume with the host) works nice, but unfortunately works only for notebooks. If you have a lot of different interpreter configured (like me) re-configuring them every time the image is updated is really time consuming and boring.
To be sure that your container is using volumes, you can check it using the inspect command:
docker inspect zeppelin
The output is a JSON with detailed information on the container. Look for the Mounts node:
If you are using a previous version of the image, my recommendation is to download this updated one so that you’ll be ready for future updates.
If you’re not using Apache Zeppelin yet, you should really start. No matter if you are a Data Scientists or a DBA, Zeppelin is *really* useful to create nice visualization and dashboard just using T-SQL:
After a little bit of work I’ve managed to have syntax highlight for T-SQL in Apache Zeppelin 0.6.2 working correctly.
The tricky part is that the Ace Editor already supports T-SQL specific syntax highlighting from v 1.2.0, but Apache Zeppelin is still using version v 1.1.9.
Moving the v 1.2.0 doesn’t work since it creates some compatibility issues so the only way to go was to manually patch and tweak Apache Zeppelin to use the highlighting file for T-SQL available in version 1.2.0.
Said and done, now T-SQL is beautifully highlighted:
SQL Server 2016 and vNext aren’t supported yet but I’ll work on this in future for sure.
Both the GitHub repository and the Docker Hub are already updated. To update your docker image, if you already have downloaded it before, just do the usual pull:
docker pull yorek/zeppelin-sqlserver:v0.6.2
Remember to export your notebooks to be sure not to loose any work done so far (this behavior will be fixed soon). Then go for
docker stop zeppelin
docker rm zeppelin
docker run -p 8080:8080 --name zeppelin -d yorek/zeppelin-sqlserver:0.6.2
In the previous post I’ve shown how to download and run Apache Zeppelin 0.6.2 for SQL Server, using Docker to simplify the installation procedure. It’s now time to start to use the fantastic Apache Zeppelin to connect to SQL Server and run some queries.
This version specifically created for SQL Server, already comes with a predefined and almost configured interpreter so that you just have to specify the SQL Server (or Azure SQL or Azure DW) server you want to connect to, the database, the username and the password and you’re good to go.
Apache Zeppelin Interpreter
An Apache Zeppelin interpreter is a library that allows Apache Zeppelin to connect to and use a specific database, data provider or even language.
Notebooks and Paragraphs
A notebook is made of one or more paragraph, which are the containers where you actually write and run code. To specify which interpreter a paragraph should use, you just have to write the interpreter name prefixed by the percentage (%) character:
Each notebook can use one or more interpreter. The process of defining which interpreter a notebook can use is called binding. The interpreters shipped with Apache Zeppelin 0.6.2 for SQL Server are:
To specify which interpreter is available in the active notebook, you can just click on the the little gear icon on the top right:
Apache Zeppelin 0.6.2 for SQL Server comes with the following interpreters already bound to each notebook by default:
The default interpreter will be used if you don’t specify, for a paragraph, which interpreter you want to use. As you can see you can mix in a notebook different interpreters so that you can do almost everything you want. The Markdown is very useful since with that you can create self-explanatory notebooks.
Before using an interpreter you have to configure it. This can be done using the “Interpreter” menu accessible by clicking on the drop-down arrow visible on the right of the top search bar.
Click on Interpreter menu item to enter configuration page
Each interpreter has its own set of configuration options. For T-SQL you have to specify the following options:
Beside the usual username, password, initial database you have couple of more options:
The JDBC driver to use. Just leave the default, unless you want to use, for any reason, another JDBC driver to connect to SQL Server, which is very unlikely. I’ll probably hide this option in future now that Microsoft has officially released and open sourced JDBC driver for SQL Server and which is the JDBC driver to use.
Is the “address” of the SQL Server instance you want to connect to, using the JDBC format:
more details on the available properties can be found here: Building the Connection URL.
Big Data means that you cannot work taking a look at all your data. It is just too big (by definition!). So you’ll normally want to avoid to return all of your rows if, by mistake, you forget to specify a where clause. Here you specify the maximum number of rows you want to be returned ever. 1000 is the default number and should be more than enough. If you want to change it, keep in mind that all downloaded data will live in your browser memory space, so you don’t really want to push that number too high.
this is a very tricky and interesting property: it allows you to decide if all paragraph in the notebook will share the same connection or each paragraph will have a dedicated connection on it’s own.
With a notebook connection you will be able to share #temp tables between paragraphs, for example, but you won’t to able to execute two or more paragraph in parallel. Each paragraph will, in fact, be executed using a FIFO logic.
By using a paragraph connection style, on the opposite, each paragraph will be totally independent and isolated from the others (which means it will have it’s own spid), just like it happens when you use more than one document with SQL Server Management Studio. This connection style allows each paragraph to be executed in parallel with the others.
Ready, set, go!
Once the settings are saved, you can start to query your SQL Server instance:
T-SQL query executed in Apache Zeppelin 0.6.2
See it in action
As I did the last time, I’ve also created a quick video to show, in less than 5 minutes, how you can configure Apache Zeppelin 0.6.2 for SQL Server and run your first queries:
As promised in my last post here is the first post that aim to show how to use Apache Zeppelin with SQL Server. First thing first: installation. Or maybe not.
The first thing to do, usually, is installing the product. Since we’re talking Java here, things may get a little bit long and complex if, like me, you’re more a .NET guy. Even worse if your not a Java nor .NET guy. You’re just a DBA or a Data Scientist. Well Docker is here to help.
Download and install Docker. It’s very easy an takes a few minutes only.
Once it is running, open a console where you can send docker commands (any terminal if on Linux or macOS, PowerShell if running on Windows, Docker Quickstart Terminal if running using the Docker Machine toolbox) and go for the following commands:
docker pull yorek/zeppelin-sqlserver:0.6.2
docker run -p 8080:8080 --name zeppelin -d yorek/zeppelin-sqlserver:0.6.2
The first download the docker image (depending on your connection speed it may take a while) and the second run the docker container with Apache Zeppelin inside. It also expose the port 8080 so that it can be used to reach the contained Apache Zeppelin process.
That’s it. Now you can connect to your local machine and start using Apache Zeppelin:
If you’re still using the “old” Docker Machine (maybe because, like me, you also need to use VMWare and cannot then install Hyper-V), you have to connect to your Docker Machine instead of localhost. To get the IP Address of your Docker Machine simply do
From the Docker Quickstart Terminal.
To view the entire process in just a minute, here’s a short video:
Next stop: Configure Apache Zeppelin and run your first query against SQL Server.
I’ve updated the code-base to Apache Zeppelin 0.6.2 and I’ve also finished a first simple-but-working support to autocomplete (you can activate it using CRTL + .). Right now the autocomplete is based on the keywords specified here:
Reserved Keywords (Transact-SQL)
is not much, I know, but is something, at least. Next steps will be to read schemas, tables and columns from SQL Server catalogs table. And maybe extract the list of keywords from…somewhere else, to have a more complete coverage.
I’ve also removed additional Interpreter that may not be useful if you just plan to use it against T-SQL/TDS compatible engines (SQL Server, Azure SQL and Azure DW), and configured the defaults so that it is ready to use SQL Server right from the beginning.
The code — along with compilation/install/basic usage instructions — is available on GitHub:
Apache Zeppelin 0.6.2 for SQL Server
Right now I’ve tested it only on Ubuntu Linux 16.04 LTS 64bits. It should also work on native Windows, but since I haven’t tried it yet on that platform, I don’t know the challenged you may face in order to have the full stack (Java, Maven, Node, ecc. ecc.) working in order to be able to compile and run it.
At the beginning of next week I’ll release a small tutorial to show how you can use Apache Zeppelin for SQL Server also on your Windows machine using Docker. I plan to do a few tutorials on the subject, since I find Apache Zeppelin very useful and I’m sure it will be loved also by many other SQL Server guys once one start to play with it.
At some point I’ll will also release only the bin package so that one doesn’t have to compile it itself (but hey, do we love Linux right now, don’t we?) and so that it can just run on Windows, but for now I find the Docker container approach so much better than anything else (it “just runs” and I can do anything via GitHub and Docker Hub), that I’ll stay with this for a while.