注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

奕克

爱家人爱工作爱生活

 
 
 

日志

 
 

使用IIF优化MDX查询中使用的filter  

2012-07-19 12:03:25|  分类: 数据挖掘 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

基本的想法是,用filter会阻止块模式使用,而用iif取代filter,能大大减少检查单元格的数目。
比如下面的MDX查询:
WITH    MEMBER [Measures].AvgGrowingProducts AS      Avg     (       Filter       (         [Product].[Product].[Product].MEMBERS        ,[Measures].[Sales Amount] > ([Measures].[Sales Amount],ParallelPeriod([Date].[Calendar].[Month]))       )      ,[Measures].[Sales Amount]     )  SELECT    [Measures].AvgGrowingProducts ON 0  ,Descendants   (     [Date].[Calendar].[Calendar Year].&[2003]    ,[Date].[Calendar].[Date]   ) ON 1 FROM [Adventure Works];
检查了295011个单元格,耗时7秒。而改为用iif:
 
WITH    MEMBER [Measures].Growth AS      IIF     (       [Measures].[Sales Amount] > ([Measures].[Sales Amount] ,ParallelPeriod([Date].[Calendar].[Month]))      ,[Measures].[Sales Amount]      ,NULL     )     ,FORMAT_STRING = 'Currency'    MEMBER [Measures].AvgGrowingProducts AS      Avg     (       [Product].[Product].[Product].MEMBERS      ,[Measures].Growth     )  SELECT    [Measures].AvgGrowingProducts ON 0  ,Descendants   (     [Date].[Calendar].[Calendar Year].&[2003]    ,[Date].[Calendar].[Date]   ) ON 1 FROM [Adventure Works];
这就只需要汇聚365个单元格了,耗时200毫秒。


原文在这:http://sqlblog.com/blogs/mosha/archive/2008/10/22/optimizing-mdx-aggregation-functions.aspx

算了,好文,全文转载,以备后查吧:

Optimizing Count(Filter(...)) expressions in MDX

As the readers of my blog know, bulk evaluation mode (called "block computation mode" in Katmai) delivers much better performance in MDX than the cell-by-cell evaluation mode. Therefore the most important optimization technique with MDX in Analysis Services is to rewrite MDX in such a way that makes block computations possible. Easy to say, but not always easy to do. With the release of Katmai's CTP5, Microsoft published the BOL article outlining conditions when block computations are and are not possible. Chris Webb picked on this article and mentioned in his blog, that "the list of set functions is a bit limited (where is Filter?)". In reply I said that I didn't think that Filter function was that common inside the MDX calculations, but Greg Galloway immediatelly came up with a good example, one that involves Count(Filter(...)). This is indeed a common calculation - every time we want to know how many entities are there which satisfy certain condition. (Greg's example was to find out how many physicians performed 10 or more cases during certain time period.)

Let's build an example using Adventure Works sample cube and see how we can optimize it. In Adventure Works terms, our task would be to find the number of products which had more than 5 orders placed over the Internet. Such calculation can be written as

Count(Filter([Product].[Product].[Product], [Measures].[Internet Order Quantity] > 5))

Indeed, neither in AS2005 nor in AS2008 the Filter function is optimized to work in the block computation mode, therefore the query involving this calculation will execute in the cell by cell mode:

WITH MEMBER [Measures].[High Volume Products Count] AS 
 Count(Filter([Product].[Product].[Product], [Measures].[Internet Order Quantity] > 5))
SELECT [Customer].[Customer Geography].[Country] ON 0
, [Date].[Calendar].[Date].MEMBERS ON 1
FROM [Adventure Works]
WHERE [Measures].[High Volume Products Count]

If we execute this query in MDX Studio, we will get the following stats:

Time             : 30 sec 781 ms
Calc covers      : 4
Cells calculated : 4217436
Sonar subcubes   : 2
SE queries       : 1
Cache hits       : 1
Cache misses     : 1
Cache inserts    : 1
Cache lookups    : 2
Memory Usage KB  : 4160

The best hint here that this query indeed executed in the cell-by-cell mode (beyong the slow execution time) is value of "Cells calculated" perfmon counter. Now, the way most people approach optimization for such MDX is trying to reduce the number of cells to iterate. One way to do it is to eliminate manually all the empty cells from Filter. I.e., if value of [Internet Order Quantity] is NULL for certain product, it is definitely less than 5. With this in mind, one possible rewrite for the query would be

WITH MEMBER [Measures].[High Volume Products Count] AS   Count(Filter(   Exists([Product].[Product].[Product],,"Internet Sales")   ,[Measures].[Internet Order Quantity] > 5)) SELECT [Customer].[Customer Geography].[Country] ON 0 , [Date].[Calendar].[Date].MEMBERS ON 1 FROM [Adventure Works] WHERE [Measures].[High Volume Products Count] 

Here Exists inside Filter eliminates the empty space. Let's run this again in MDX Studio and look at the new stats

Time             : 12 sec 46 ms Calc covers      : 954 Cells calculated : 50036 Sonar subcubes   : 7900 SE queries       : 7899 Cache hits       : 7899 Cache misses     : 1 Cache inserts    : 1 Cache lookups    : 7900 Memory Usage KB  : 43936 

The time indeed decreased, from 30 seconds to 12 seconds. This is good, but something doesn't add up. The number of cells went down from 4 million to only 50 thousands, i.e. about 85 times, yet the execution time decreased by only 2 times. The explanation is that putting Exists (or NonEmpty) inside MDX calculation is usually a bad idea - since now for every cell, there is an SE query being sent (it can be seen through "Query Subcube" event in trace too). In our case there were 7900 such queries sent. Because the set inside Exists was always the same and the context was fixed, only 1 such query actually had to go to disk, and other 7899 hit the cache, yet the overhead of issuing SE query is non-trivial, even when it is answered from the cache.

Conclusion is that even though we get some performance gain from this approach, it doesn't play well with the rest of the system, and only drives us further from the goal of switching to the superior block computation mode. We need to rewrite the calculation in such a way that we eliminate Filter. Fortunately, it is possible to do. Let's recall that Count function returns number of tuples in the set, and Filter return set of tuples which satisfy certain condition. I.e. we are counting how many tuples satisfy a condition. If we convert count to sum, and sum up 1's every time when the condition is met and 0's every time when the condition is not met, we will get the same result. I.e.

Count(Filter(set, condition)) = Sum(set, Iif(condition, 1, 0))

Using this formula we can rewrite our calculation as following:

WITH MEMBER [Measures].[High Volume Products Count] AS   Sum(    [Product].[Product].[Product],     Iif([Measures].[Internet Order Quantity] > 5,1,0)) SELECT [Customer].[Customer Geography].[Country] ON 0 , [Date].[Calendar].[Date].MEMBERS ON 1 FROM [Adventure Works] WHERE [Measures].[High Volume Products Count] 

Executing this query yields the following stats:

Time             : 6 sec 375 ms Calc covers      : 4 Cells calculated : 6948 Sonar subcubes   : 1 SE queries       : 1 Cache hits       : 1 Cache misses     : 1 Cache inserts    : 1 Cache lookups    : 2 Memory Usage KB  : 0 

This is definitely much better. The time is down to 6 seconds, and we can tell that within every cell the Sum works very efficiently, because we only see 6948 cell calculated (this is the exact number of cell in the resulting cellset). Yet, we are not in true block computation mode yet. We are still running Sum for every cell, instead of computing the entire query in one operation. What prevents us from doing it now ? Now the problem is with Iif function inside the Sum. I have written about Iif function and its interaction with block mode in the past. If we reread that article, we will see that we are in the scenario where condition inside Iif looks at the cell values and not at the attribute coordinates, and the only thing we can do here is to have one of the branches to return NULL. And it probably makes sense to define our calculated member to return NULL instead of 0 when there are no products which satisfy our condition. After this rewrite we get

WITH MEMBER [Measures].[High Volume Products Count] AS   Sum(    [Product].[Product].[Product],     Iif([Measures].[Internet Order Quantity] > 5,1,NULL)) SELECT [Customer].[Customer Geography].[Country] ON 0 , [Date].[Calendar].[Date].MEMBERS ON 1 FROM [Adventure Works] WHERE [Measures].[High Volume Products Count] 

This gets us down to 4 seconds, which is yet another improvement, but still doesn't yet reach the best execution plan. Now it is time to use the performance optimization hints. Looking at the expression Iif([Measures].[Internet Order Quantity] > 5,1,NULL) we observe that it is guaranteed to be NULL when [Measures].[Internet Order Quantity] is NULL. Therefore we can define the MDX script flavor of NON_EMPTY_BEHAVIOR for it. Since NEB cannot be defined on subexpressions, we will separate it into special calculated measure. Let's write the following fragment inside MDX Script:

CREATE HIDDEN Summator; [Measures].[Summator] = Iif([Measures].[Internet Order Quantity] > 5,1,NULL); NON_EMPTY_BEHAVIOR([Measures].[Summator]) = [Measures].[Internet Order Quantity]; 

And then use it inside our calculation:

WITH  MEMBER [Measures].[High Volume Products Count] AS   Sum([Product].[Product].[Product], [Measures].[Summator]) SELECT [Customer].[Customer Geography].[Country] ON 0 , [Date].[Calendar].[Date].MEMBERS ON 1 FROM [Adventure Works] WHERE [Measures].[High Volume Products Count] 

Finally, when we execute this query, it finishes in about 0.2 second (218 milliseconds). This is the performance we expect to observe from the block computation mode. From 32 seconds to 0.2 seconds - improvement of over 160 times !

Now the even better news are, that starting with Katmai November CTP5 (which was released couple of days ago, so I finally can publicly speak about it), the trick with explicitly defining NON_EMPTY_BEHAVIOR is not required, the engine recognizes it itself. Therefore, even our previous query (the one which took 4 seconds in AS2005) returns in 0.2 seconds in AS2008. (This improvement in query optimizer in Katmai is just a tiny little tip of the iceberg around improvements in block computation query plans, and I intend to cover this subject in more depth in the upcoming blogs, if I will have enough time for that).


Optimizing MDX aggregation functions

One of the most significant changes in Analysis Services 2008 was improving performance of MDX queries and calculations. In particular, query optimizer can choose block (a.k.a. subspace) computation mode in query plan more often than in AS2005, and usually using block/subspace computation mode brings performance orders of magnitude better than without it. However, even in AS2008, query optimizer is not always capable of using block mode. The “Performance Improvements for MDX in SQL Server 2008 Analysis Services” whitepaper documents which MDX constructs are optimized and which not, and MDX Studio automates process of analyzing MDX and figuring out which fragments are not optimized, and offers advice how to optimize them. In this article we will cover techniques related to the optimization of aggregation functions in MDX, i.e. Sum, Min, Max, Aggregate and (in AS2008) Avg. All examples were done on AS2008 version of Adventure Works (some of these techniques will work with AS2005, but others won’t).

1. Aggregating over Filter

Scenario: We want to compute average sales for the products which increased its sales since same date last month. The normal approach is to use Filter function to determine which products grew their sales, and then apply Avg to the resulting set. Now applying this to every day in the Year 2003, we will get

WITH    MEMBER [Measures].AvgGrowingProducts AS      Avg     (       Filter       (         [Product].[Product].[Product].MEMBERS        ,[Measures].[Sales Amount] > ([Measures].[Sales Amount],ParallelPeriod([Date].[Calendar].[Month]))       )      ,[Measures].[Sales Amount]     )  SELECT    [Measures].AvgGrowingProducts ON 0  ,Descendants   (     [Date].[Calendar].[Calendar Year].&[2003]    ,[Date].[Calendar].[Date]   ) ON 1 FROM [Adventure Works];

Running this query took more than 7 seconds on my laptop, and the culprit is clear from the perfmon counters – there were 295011 cells calculated – this is a sign of cell-by-cell iterations as opposed to working in the block mode. And if we run “Analyze” function of MDX Studio, it will tell us why this is happening: Function ‘Filter’ was used inside aggregation function – this disables block mode. How do we get rid of Filter ? I actually has written about it before – in the “Optimizing Count(Filter(...)) expressions in MDX” with respect to the Count function, but same technique can be applied to any aggregation function, including Avg (in AS2008). The idea is simple – even though Filter reduces the set of products, in order to compute the Filter in the first place, we really have to check sales of every single product, so we may as well just run Avg over all products, but with smart expression which will return NULL in case product should not be included – and since Avg ignores NULLs, this will work correctly. Here is how rewritten MDX will look like:

WITH    MEMBER [Measures].Growth AS      IIF     (       [Measures].[Sales Amount] > ([Measures].[Sales Amount] ,ParallelPeriod([Date].[Calendar].[Month]))      ,[Measures].[Sales Amount]      ,NULL     )     ,FORMAT_STRING = 'Currency'    MEMBER [Measures].AvgGrowingProducts AS      Avg     (       [Product].[Product].[Product].MEMBERS      ,[Measures].Growth     )  SELECT    [Measures].AvgGrowingProducts ON 0  ,Descendants   (     [Date].[Calendar].[Calendar Year].&[2003]    ,[Date].[Calendar].[Date]   ) ON 1 FROM [Adventure Works];

Now it executes in mere 200 ms, and number of cells calculated is 365 – exactly the same number as number of cells in the result – which is the best theoretical value we can get !

2. Aggregating over NonEmpty

Scenario: Compute average sales for all products. The solution is straightforward:

WITH    MEMBER [Measures].AvgProductSales AS      Avg     (       [Product].[Product].[Product].MEMBERS      ,[Measures].[Sales Amount]     )  SELECT    [Measures].AvgProductSales ON 0  ,[Date].[Date].[Date].MEMBERS ON 1 FROM [Adventure Works];

It is very efficient – only 100 ms and single SE query. But what if someone wants to optimize it even more ? Is this possible ? Poking with MDX Studio expression debugger, we can notice, that the sets over which Avg is run are pretty sparse. For example, on August 7, 2001 – there were only 3 products which had sales, the rest was empty. Year later, on August 7, 2002 there were only 5 products which sold. So, it might be tempting to add NonEmpty function over the products, in a hope to reduce the size of the set before it is fed to Avg:

WITH    MEMBER [Measures].AvgProductSales AS      Avg     (       NonEmpty       (         [Product].[Product].[Product].MEMBERS        ,[Measures].[Sales Amount]       )      ,[Measures].[Sales Amount]     )  SELECT    [Measures].AvgProductSales ON 0  ,[Date].[Date].[Date].MEMBERS ON 1 FROM [Adventure Works];

This turns out to be a big mistake. The execution time jumps 18-fold to 1 sec 796 ms, and the all the perfmon counters jump as well, with number of SE queries going to 1189 from 1. I have written about this before, but it is worthwhile to reiterate. Unfortunately, this kind of wrong advice “reduce the space with NonEmpty before applying computation” is a popular myth. I see it mentioned in various presentations, and it even shows up in otherwise good “MDX Performance Hints” document. The irony is that the tip which is supposed to improve performance, actually makes it worse. The reason here is twofold:

1) Since NonEmpty is used inside calculated member, it will be computed every time this calculated member is invoked. No block mode. Each call to NonEmpty triggers at least one SE query. This is why we see so many SE queries in perfmon – one for every cell which uses AvgProductSales calculated member.

2) The result of NonEmpty could be a set of any shape and form. It is very difficult for Avg to work with such unpredictable sets. In our example we got lucky, that since we only had single hierarchy inside NonEmpty – we didn’t end up with arbitrary shape set – otherwise performance would’ve been even worse.

So the guidance is to never use NonEmpty inside calculations, but it is OK to use NonEmpty while building axes or named sets – because then it will be only called once and not for every cell.

3. Aggregating over Union

Scenario: Running sum by day. Typical and very sensible approach to this problem is to call Sum over all the preceding days. Assuming that our calculation will only ever be called on Date attribute (this is easy to ensure with SCOPE assignment, but for simplicity I will put it here as regular calculated member):

WITH    MEMBER [Measures].RunSales AS      Sum     (       NULL : [Date].[Date].CurrentMember      ,[Measures].[Sales Amount]     )  SELECT    [Measures].RunSales ON 0  ,[Date].[Date].[Date].MEMBERS ON 1 FROM [Adventure Works];

This works OK – about 2 seconds execution time, and in block mode. We still want to optimize it. The common optimization technique for running sum is to do summation at higher grains whenever possible. I.e. instead of running over days, we would like to run over months until the current month, and only from there over the remaining days in this month. This approach makes a lot of sense, but the implementation should be done carefully. Let’s see what will happen if we modify the set to be union of months and days:

WITH    MEMBER [Measures].RunSales AS      Sum     (       Union       (         NULL : [Date].[Calendar].Parent.PrevMember        ,         [Date].[Calendar].FirstSibling : [Date].[Calendar].CurrentMember       )      ,[Measures].[Sales Amount]     )  SELECT    [Measures].RunSales ON 0  ,[Date].[Date].[Date].MEMBERS ON 1 FROM [Adventure Works];

The results are disastrous ! The execution time rose above 3 seconds, but more alarmingly, both number of cells calculated and SE queries jumped to thousands. Why is that ? After all, if we read the official documentation, it says that Sum over Union is in fact optimized for block mode, but the results we see here clearly show the opposite. The root cause is a little bit deeper. It is true that Sum(Union()) is optimized, but one of the other conditions for aggregation functions to work optimally, is to have the input set in a good shape, i.e. not arbitrary shaped set. Our Union combines sets from different granularities, so we end up with the mixed grain set, which is classified as arbitrary shape. How can we fix it ? One way is to break single Sum into two Sum’s, each one over the set of uniform single grain, i.e.

WITH    MEMBER [Measures].RunSales AS        Sum       (         NULL : [Date].[Calendar].Parent.PrevMember        ,[Measures].[Sales Amount]       )     +        Sum       (         [Date].[Calendar].FirstSibling : [Date].[Calendar].CurrentMember        ,[Measures].[Sales Amount]       )  SELECT    [Measures].RunSales ON 0  ,[Date].[Date].[Date].MEMBERS ON 1 FROM [Adventure Works];

Now the results are great – the runtime is about 400 ms, and other stats look good too. It is possible to optimize it even further using techniques described in the “Take advantage of FE caching to optimize MDX performance” article. We notice that the first component of plus is the same for all days in the month, so we can take advantage of it by caching it at the first day and reusing for all other days as following:

WITH    MEMBER [Measures].RunMonthSales AS       Sum       (         NULL : [Date].[Calendar].CurrentMember        ,[Measures].[Sales Amount]       )   MEMBER [Measures].RunSales AS        ([Measures].RunMonthSales, [Date].[Calendar].Parent.PrevMember)     +        Sum       (         [Date].[Calendar].FirstSibling : [Date].[Calendar].CurrentMember        ,[Measures].[Sales Amount]       )  SELECT    [Measures].RunSales ON 0  ,[Date].[Date].[Date].MEMBERS ON 1 FROM [Adventure Works];

This gives better runtime of 280 ms, on bigger cubes the difference will be more significant.

4. Aggregating over Exists or CrossJoin ?

To be written when I get more time.

5. Aggregting over Descendants or EXISTING ?

To be written when I get more time.



这个命名集的使用规则,也得注意,好像是说在sum函数中第一个参数不能使用命名集,但没说为什么。

Usage of Named Sets or Set Alias in parameters

Any time that a named set or a set alias is used as the first parameter in the functions SumMinMaxAvg, or Aggregate in your MDX code, your code will not benefit from the performance improvements.

For example, the following MDX expression counts how many members have more than one child.

Sum(h.members as S, Iif(S.Current.Children.Count > 1, 1, 0))

Because h.members is being aliased as S and later Current function value is taken from the aliased set, it prevents the expected improvement in performance from being obtained.

Another common example of this situation is illustrated in the following code.

WITH

SET [Core Products] AS '{[Product].[Category].[Bikes]}'

MEMBER [Measures].[Core Products Sales] AS SUM([Core Products], [Measures].[Internet Average Unit Price] * [Measures].[Internet Order Quantity])

Select [Measures].[Core Products Sales] on 0

From [Adventure Works]

The SUM function in the member definition does not obtain the expected performance improvement because it is based on a named set.

  评论这张
 
阅读(935)| 评论(0)
推荐 转载

历史上的今天

在LOFTER的更多文章

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017