1
****************************** BASIC APPROACH *******************************
2
=============================================================================
3
* To make fast graphs we are reducing actual number of points for processing.
4
Basicaly, if we need to display picture 600x600 we need not more than 600
5
points. To get this points from the (possibly) huge amount of real data we
6
are using two step algorithm.
10
* At first step we dividing the actual time window (whole interval we are
11
interested in) in the N intervals and aggregating the data over it. For each
15
- Number of data points within interval
16
- Aggregated values for all channels (MIN, MAX, MEAN)
17
- In the future versions over aggregated types like (MEDIAN) could be added
18
- Optionaly we could have information about maximal data gap on the interval
19
(please consider "DATA GAPS" section for information about precision)
21
* Of course the database requests obtaining this information over big intervals
22
would take a lot of time. Therefore, we have CACHE tables containing aggregated
23
information for few standard interval sizes (configurable depending on the
25
When data request is comming, we are:
26
1) Checking if the required intervals are quite close to one of CACHE
27
tables and returning information from appropriate CACHE table if so.
28
2) Otherwise, we are aggregating the data from the biggest possible
29
CACHE which still would provide enough data points (when aggregated)
30
3) For small intervals, the aggregation over raw data table is executed.
31
However, it do not take much time on the small interval size.
33
* The only problem is cases when we have few small periods with very high
34
update rate, while the rest of data is sampled with much lower frequences. In
35
that case it makes no sense (and will take a lot of disk space) to make CACHE
36
table for aggregation. But amount of data in specific places is to much for
37
in-place aggregation and, therefore, this places would be displayed with big
39
To handle such situation, we are using so called subcaches. [ WILL BE
40
INTRODUCED in the next version ]
44
* On second step we are converting intervals to the data points. Actually
45
we using several approaches to get different view of the data. Just smoth
46
averaged graph to get impression on the global processes. The MIN-MAX approach
47
representing all extrim values of data channels, etc.
52
N + 1 points are used for representing N intervals.
53
* Points are set in the begining, end and between all intervals.
54
* For placing Y the following MIN-MAX-MIN algorithm is used. Consider two
55
intervals: (min1,max1), (min2,max2).
56
The following free points (a,b,c) will be used.
57
1. starting from minimum: a = min1
58
2. looking if second interval completely above as
59
a) if so, puting b = min2
60
- location: begining of second interval
61
- next iteration we will continue rise (at least to max2)
62
b) otherwise, b is MAX(max1, max2)
63
- location: end of ivl1 or begining of ivl2
64
- next iteration will umenshatsya.
65
3. c = max2 or min2 depending on previous step
66
* All points are proccessed in the chunk and, therefore, we would have
67
only one extra point compared to number of intervals.
71
* The described procedure should be interupted if we got a gap in the data.
72
Otherwise we would get high precision problems. However, for each brake we
73
would have extra point which would break specified limit of points. In order
74
to handle that problem, we ignoring small gaps and restarting algorithm only
75
if there are enough space in the gap for another point I.E. we normally do
76
not put points in the gap area. Therefore, if we expect a point to be
77
put there (if it would not be a gap), we consider that we have an extra
78
(saved) point and are allowed to restart algorithm.
80
* To find a mimimal size of gap, for which we are restarting algorithm, we
81
are estimating the 'expected' gap between sequent points. When we dealing
82
with non-raw (CACHE) data this value is exact (the 'resolution', window /
83
interval width and maximal allowed number of points '$limit' - is used
86
Things a little bit more complicated while dialing with raw data. We do not
87
actually know how much data points will be found. However, this value should
88
be below limit. It would cause very often restarts in the cases if were are
89
in reality much less points than allowed.
90
To handle that case, we are differntly treating the intervals with a single
91
data point inside. If we are out of our algorithm (just starting or after
92
restart) we just puting the points in appropriate place without starting
93
algorithm. Of course, if we are already processing the data using MINMAX
94
algorithm, we would continue while the points within allowed range from each
95
other. So, summarizing:
96
a) if we have high density of points, the distance between points will fit
97
in the approximated range. They would be aggregated and there will be no
98
intervals with single point. Therefore, standard processing using MIN-MAX
100
b) if we have low density of points we will have single-point intervals and
101
will process them directly.
102
c) In the mixed cases, when we have variable density over interval. The
103
MINMAX algorithm would exit on first single-alone (single point, and located
104
away from others) point and enter MINMAX mode again then the dense area
107
* This all was not about reporting missing data to the user, this gap could
108
be completely OK from data collecting point of view.
112
If "CACHE::REPORT_EMPTY" mode is switched on and there is intervals without
113
data present, this intervals will be reported. Normally, missing data just
116
- Missing points on the edges are not reported.
117
- Gaps bellow doubled 'expected' gap (see section Handling Gaps above) are
119
- By default only the gaps above interval size in minimal used CACHE table
120
are reported. However, the minimal size could be adjusted in configuration on
123
Expected Position Error
124
-----------------------
125
Maximal error for this approximation is 2 intervals. Consider
126
folowing case: ---.-|-.--- (3 intervals, with short drop within
127
one of them). This case could be handled
130
I.e. in any case the some lines are actually two intervals away
134
***************************** DATA AGGREGATION ******************************
135
=============================================================================
139
There are two parts in the data gap calculation code: creating caching tables,
140
aggregating the data from caching tables.
142
1. The 'missing' attribute in non-raw caching tables are used to store maximal
143
gap in the data. The current code have limited precision, the real gap is
144
in the following range: "missing <= real_gap < 3 x missing".
145
This is because of edge cases. For example we have two cache levels: 60 and 10,
146
and if we have following sequence of 10th: |. |,| |,| .|
147
single point in the begining, no points, single point in the end. The current
148
algorithm will produce only 'missing=10' for encompassing cache60.
149
[ We don't check edges, if there would be no points in the first and third
150
intervals, the result would be correct - 30. ]
152
2. For speed reasons we perform aggregation on the database side, not in PHP.
153
Unfortunatelly, there is no flexible way to process sequential rows and found
154
the distance between them (or when the amount of points drops to 0 and rises
155
again) [ actually there is a non standard way for doing that, but it doesn't
156
work in php, only from mysql client, and it considers undocumented behaviour
157
wich could change in the future versions of mysql. The way is based on
158
variables and sequence of their evaluation. The example is looking, like:
160
SELECT UNIX_TIMESTAMP(MIN(time)) AS timestamp, COUNT(*) AS items, (UNIX_TIMESTAMP(MAX(time)) - UNIX_TIMESTAMP(MIN(time))) AS width, MAX(IF(@tmpvar_pos=FLOOR((UNIX_TIMESTAMP(`time`) - 1185781200 )/171.428571429), (UNIX_TIMESTAMP(time)-@tmpvar_width), 0)) AS maxgap, @tmpvar_pos:=FLOOR((UNIX_TIMESTAMP(`time`) - 1185781200 )/171.428571429) AS tmpcol1, @tmpvar_width:=UNIX_TIMESTAMP(time) AS tmpcol2, MIN(v1) AS min0, MAX(v1) AS max0, AVG(v1) AS mean0 FROM `cache0__katrin__hauptspektrometer__0` WHERE ((`time`>= 20070730074000) AND (`time` < 20070730080000)) GROUP BY FLOOR((UNIX_TIMESTAMP(`time`) - 1185781200 )/171.428571429) ORDER BY `time` ASC;
162
At the moment we just expect what the data is unifromly distributed over
163
aggregated intervals. This could lead to ever bigger gaps inaccuracy. However,
164
approximated gap values still do not exceed real value.
165
In the future versions we should implement the precise way of gap calculation
166
using stored functions for processing sequential rows.
169
******************************* OPTIONS *************************************
170
=============================================================================
175
Various per-group options
176
=========================
178
Fill RAW cache table completely prior to processing agregating cache tables
180
This is mainly intended for sources without correct indexes. In such
181
situations selecting the data for a specified time interval make take
182
a while (database engines iterates over whole dataset finding matching
183
rows, instead looking in index). Therefore, we are doing only a single
184
request to the datasource to make it only once.
186
The tests, however, shown that this method is slightly faster than default
187
one (31s agains 39s) and, therefore, could be used to optimize speed even
188
if indexes are okey. [ Actually this is really dependent on relative
189
performance of data source server and the caching server and, therefore,
190
it's better recheck if it's really optimizion not slowdown. ]