bzr branch
http://darksoft.org/webbzr/adei/trunk
1
by Suren A. Chilingaryan
Initial import |
1 |
****************************** BASIC APPROACH ******************************* |
2 |
============================================================================= |
|
3 |
* To make fast graphs we are reducing actual number of points for processing. |
|
4 |
Basicaly, if we need to display picture 600x600 we need not more than 600 |
|
5 |
points. To get this points from the (possibly) huge amount of real data we |
|
6 |
are using two step algorithm. |
|
7 |
||
8 |
First Step |
|
9 |
---------- |
|
10 |
* At first step we dividing the actual time window (whole interval we are |
|
11 |
interested in) in the N intervals and aggregating the data over it. For each |
|
12 |
interval we know |
|
13 |
- Interval start |
|
14 |
- Interval width |
|
15 |
- Number of data points within interval |
|
16 |
- Aggregated values for all channels (MIN, MAX, MEAN) |
|
17 |
- In the future versions over aggregated types like (MEDIAN) could be added |
|
18 |
- Optionaly we could have information about maximal data gap on the interval |
|
19 |
(please consider "DATA GAPS" section for information about precision) |
|
20 |
||
21 |
* Of course the database requests obtaining this information over big intervals |
|
22 |
would take a lot of time. Therefore, we have CACHE tables containing aggregated |
|
23 |
information for few standard interval sizes (configurable depending on the |
|
24 |
expected data rate). |
|
25 |
When data request is comming, we are: |
|
26 |
1) Checking if the required intervals are quite close to one of CACHE |
|
27 |
tables and returning information from appropriate CACHE table if so. |
|
28 |
2) Otherwise, we are aggregating the data from the biggest possible |
|
29 |
CACHE which still would provide enough data points (when aggregated) |
|
30 |
3) For small intervals, the aggregation over raw data table is executed. |
|
31 |
However, it do not take much time on the small interval size. |
|
32 |
||
33 |
* The only problem is cases when we have few small periods with very high |
|
34 |
update rate, while the rest of data is sampled with much lower frequences. In |
|
35 |
that case it makes no sense (and will take a lot of disk space) to make CACHE |
|
36 |
table for aggregation. But amount of data in specific places is to much for |
|
37 |
in-place aggregation and, therefore, this places would be displayed with big |
|
38 |
delay. |
|
39 |
To handle such situation, we are using so called subcaches. [ WILL BE |
|
40 |
INTRODUCED in the next version ] |
|
41 |
||
42 |
Second Step |
|
43 |
----------- |
|
44 |
* On second step we are converting intervals to the data points. Actually |
|
45 |
we using several approaches to get different view of the data. Just smoth |
|
46 |
averaged graph to get impression on the global processes. The MIN-MAX approach |
|
47 |
representing all extrim values of data channels, etc. |
|
48 |
||
49 |
||
50 |
MINMAX APPROACH |
|
51 |
=============== |
|
52 |
N + 1 points are used for representing N intervals. |
|
53 |
* Points are set in the begining, end and between all intervals. |
|
54 |
* For placing Y the following MIN-MAX-MIN algorithm is used. Consider two |
|
55 |
intervals: (min1,max1), (min2,max2). |
|
56 |
The following free points (a,b,c) will be used. |
|
57 |
1. starting from minimum: a = min1 |
|
58 |
2. looking if second interval completely above as |
|
59 |
a) if so, puting b = min2 |
|
60 |
- location: begining of second interval |
|
61 |
- next iteration we will continue rise (at least to max2) |
|
62 |
b) otherwise, b is MAX(max1, max2) |
|
63 |
- location: end of ivl1 or begining of ivl2 |
|
64 |
- next iteration will umenshatsya. |
|
65 |
3. c = max2 or min2 depending on previous step |
|
66 |
* All points are proccessed in the chunk and, therefore, we would have |
|
67 |
only one extra point compared to number of intervals. |
|
68 |
||
69 |
Handling Gaps |
|
70 |
------------- |
|
71 |
* The described procedure should be interupted if we got a gap in the data. |
|
72 |
Otherwise we would get high precision problems. However, for each brake we |
|
73 |
would have extra point which would break specified limit of points. In order |
|
74 |
to handle that problem, we ignoring small gaps and restarting algorithm only |
|
75 |
if there are enough space in the gap for another point I.E. we normally do |
|
76 |
not put points in the gap area. Therefore, if we expect a point to be |
|
77 |
put there (if it would not be a gap), we consider that we have an extra |
|
78 |
(saved) point and are allowed to restart algorithm. |
|
79 |
||
80 |
* To find a mimimal size of gap, for which we are restarting algorithm, we |
|
81 |
are estimating the 'expected' gap between sequent points. When we dealing |
|
82 |
with non-raw (CACHE) data this value is exact (the 'resolution', window / |
|
83 |
interval width and maximal allowed number of points '$limit' - is used |
|
84 |
for calculation). |
|
85 |
||
86 |
Things a little bit more complicated while dialing with raw data. We do not |
|
87 |
actually know how much data points will be found. However, this value should |
|
88 |
be below limit. It would cause very often restarts in the cases if were are |
|
89 |
in reality much less points than allowed. |
|
90 |
To handle that case, we are differntly treating the intervals with a single |
|
91 |
data point inside. If we are out of our algorithm (just starting or after |
|
92 |
restart) we just puting the points in appropriate place without starting |
|
93 |
algorithm. Of course, if we are already processing the data using MINMAX |
|
94 |
algorithm, we would continue while the points within allowed range from each |
|
95 |
other. So, summarizing: |
|
96 |
a) if we have high density of points, the distance between points will fit |
|
97 |
in the approximated range. They would be aggregated and there will be no |
|
98 |
intervals with single point. Therefore, standard processing using MIN-MAX |
|
99 |
algorithm is used. |
|
100 |
b) if we have low density of points we will have single-point intervals and |
|
101 |
will process them directly. |
|
102 |
c) In the mixed cases, when we have variable density over interval. The |
|
103 |
MINMAX algorithm would exit on first single-alone (single point, and located |
|
104 |
away from others) point and enter MINMAX mode again then the dense area |
|
105 |
begins. |
|
106 |
||
107 |
* This all was not about reporting missing data to the user, this gap could |
|
108 |
be completely OK from data collecting point of view. |
|
109 |
||
110 |
Reporting Gaps |
|
111 |
-------------- |
|
112 |
If "CACHE::REPORT_EMPTY" mode is switched on and there is intervals without |
|
113 |
data present, this intervals will be reported. Normally, missing data just |
|
114 |
silently ignored. |
|
115 |
||
116 |
- Missing points on the edges are not reported. |
|
117 |
- Gaps bellow doubled 'expected' gap (see section Handling Gaps above) are |
|
118 |
not reported . |
|
119 |
- By default only the gaps above interval size in minimal used CACHE table |
|
120 |
are reported. However, the minimal size could be adjusted in configuration on |
|
121 |
per group basis. |
|
122 |
||
123 |
Expected Position Error |
|
124 |
----------------------- |
|
125 |
Maximal error for this approximation is 2 intervals. Consider |
|
126 |
folowing case: ---.-|-.--- (3 intervals, with short drop within |
|
127 |
one of them). This case could be handled |
|
128 |
a) \\\.///.--- |
|
129 |
b) ---.\\\./// |
|
130 |
I.e. in any case the some lines are actually two intervals away |
|
131 |
from the real value |
|
132 |
||
133 |
||
134 |
***************************** DATA AGGREGATION ****************************** |
|
135 |
============================================================================= |
|
136 |
||
137 |
DATA GAPS |
|
138 |
========= |
|
139 |
There are two parts in the data gap calculation code: creating caching tables, |
|
140 |
aggregating the data from caching tables. |
|
141 |
||
142 |
1. The 'missing' attribute in non-raw caching tables are used to store maximal |
|
143 |
gap in the data. The current code have limited precision, the real gap is |
|
144 |
in the following range: "missing <= real_gap < 3 x missing". |
|
145 |
This is because of edge cases. For example we have two cache levels: 60 and 10, |
|
146 |
and if we have following sequence of 10th: |. |,| |,| .| |
|
147 |
single point in the begining, no points, single point in the end. The current |
|
148 |
algorithm will produce only 'missing=10' for encompassing cache60. |
|
149 |
[ We don't check edges, if there would be no points in the first and third |
|
150 |
intervals, the result would be correct - 30. ] |
|
151 |
||
152 |
2. For speed reasons we perform aggregation on the database side, not in PHP. |
|
153 |
Unfortunatelly, there is no flexible way to process sequential rows and found |
|
154 |
the distance between them (or when the amount of points drops to 0 and rises |
|
155 |
again) [ actually there is a non standard way for doing that, but it doesn't |
|
156 |
work in php, only from mysql client, and it considers undocumented behaviour |
|
157 |
wich could change in the future versions of mysql. The way is based on |
|
158 |
variables and sequence of their evaluation. The example is looking, like: |
|
159 |
||
160 |
SELECT UNIX_TIMESTAMP(MIN(time)) AS timestamp, COUNT(*) AS items, (UNIX_TIMESTAMP(MAX(time)) - UNIX_TIMESTAMP(MIN(time))) AS width, MAX(IF(@tmpvar_pos=FLOOR((UNIX_TIMESTAMP(`time`) - 1185781200 )/171.428571429), (UNIX_TIMESTAMP(time)-@tmpvar_width), 0)) AS maxgap, @tmpvar_pos:=FLOOR((UNIX_TIMESTAMP(`time`) - 1185781200 )/171.428571429) AS tmpcol1, @tmpvar_width:=UNIX_TIMESTAMP(time) AS tmpcol2, MIN(v1) AS min0, MAX(v1) AS max0, AVG(v1) AS mean0 FROM `cache0__katrin__hauptspektrometer__0` WHERE ((`time`>= 20070730074000) AND (`time` < 20070730080000)) GROUP BY FLOOR((UNIX_TIMESTAMP(`time`) - 1185781200 )/171.428571429) ORDER BY `time` ASC; |
|
161 |
||
162 |
At the moment we just expect what the data is unifromly distributed over |
|
163 |
aggregated intervals. This could lead to ever bigger gaps inaccuracy. However, |
|
164 |
approximated gap values still do not exceed real value. |
|
165 |
In the future versions we should implement the precise way of gap calculation |
|
166 |
using stored functions for processing sequential rows. |
|
167 |
||
168 |
||
169 |
******************************* OPTIONS ************************************* |
|
170 |
============================================================================= |
|
171 |
||
172 |
Global options |
|
173 |
============== |
|
174 |
||
175 |
Various per-group options |
|
176 |
========================= |
|
177 |
* fill_raw_first |
|
178 |
Fill RAW cache table completely prior to processing agregating cache tables |
|
179 |
||
180 |
This is mainly intended for sources without correct indexes. In such |
|
181 |
situations selecting the data for a specified time interval make take |
|
182 |
a while (database engines iterates over whole dataset finding matching |
|
183 |
rows, instead looking in index). Therefore, we are doing only a single |
|
184 |
request to the datasource to make it only once. |
|
185 |
||
186 |
The tests, however, shown that this method is slightly faster than default |
|
187 |
one (31s agains 39s) and, therefore, could be used to optimize speed even |
|
188 |
if indexes are okey. [ Actually this is really dependent on relative |
|
189 |
performance of data source server and the caching server and, therefore, |
|
190 |
it's better recheck if it's really optimizion not slowdown. ] |