/adei/trunk : contents of docs/config.txt at revision 506

: (revision 506)

To get this branch, use:

bzr branch
http://darksoft.org/webbzr/adei/trunk

****************************** BASIC APPROACH *******************************
=============================================================================
* To make fast graphs we are reducing actual number of points for processing. 
Basicaly, if we need to display picture 600x600 we need not more than 600 
points. To get this points from the (possibly) huge amount of real data we
are using two step algorithm.

First Step
----------
* At first step we dividing the actual time window (whole interval we are 
interested in) in the N intervals and aggregating the data over it. For each
interval we know
  - Interval start
  - Interval width
  - Number of data points within interval
  - Aggregated values for all channels (MIN, MAX, MEAN)
  - In the future versions over aggregated types like (MEDIAN) could be added
  - Optionaly we could have information about maximal data gap on the interval
    (please consider "DATA GAPS" section for information about precision)
 
* Of course the database requests obtaining this information over big intervals 
would take a lot of time. Therefore, we have CACHE tables containing aggregated
information for few standard interval sizes (configurable depending on the 
expected data rate). 
  When data request is comming, we are:
    1) Checking if the required intervals are quite close to one of CACHE 
    tables and returning information from appropriate CACHE table if so.
    2) Otherwise, we are aggregating the data from the biggest possible
    CACHE which still would provide enough data points (when aggregated)
    3) For small intervals, the aggregation over raw data table is executed.
    However, it do not take much time on the small interval size.
    
* The only problem is cases when we have few small periods with very high
update rate, while the rest of data is sampled with much lower frequences. In
that case it makes no sense (and will take a lot of disk space) to make CACHE 
table for aggregation. But amount of data in specific places is to much for
in-place aggregation and, therefore, this places would be displayed with big
delay. 
 To handle such situation, we are using so called subcaches. [ WILL BE 
 INTRODUCED in the next version ]

Second Step
-----------
* On second step we are converting intervals to the data points. Actually
we using several approaches to get different view of the data. Just smoth
averaged graph to get impression on the global processes. The MIN-MAX approach
representing all extrim values of data channels, etc.


MINMAX APPROACH
===============
 N + 1 points are used for representing N intervals. 
  * Points are set in the begining, end and between all intervals.
  * For placing Y the following MIN-MAX-MIN algorithm is used. Consider two 
  intervals: (min1,max1), (min2,max2). 
    The following free points (a,b,c) will be used.
	1. starting from minimum: a = min1
	2. looking if second interval completely above as 
	    a) if so, puting b = min2 
		- location: begining of second interval
		- next iteration we will continue rise (at least to max2)
	    b) otherwise, b is MAX(max1, max2)
		- location: end of ivl1 or begining of ivl2
		- next iteration will umenshatsya.
 	3. c = max2 or min2 depending on previous step
  * All points are proccessed in the chunk and, therefore, we would have 
  only one extra point compared to number of intervals.
  
  Handling Gaps 
  -------------
  * The described procedure should be interupted if we got a gap in the data. 
  Otherwise we would get high precision problems. However, for each brake we
  would have extra point which would break specified limit of points. In order
  to handle that problem, we ignoring small gaps and restarting algorithm only
  if there are enough space in the gap for another point I.E. we normally do
  not put points in the gap area. Therefore, if we expect a point to be
  put there (if it would not be a gap), we consider that we have an extra
  (saved) point and are allowed to restart algorithm.
 
  * To find a mimimal size of gap, for which we are restarting algorithm, we 
  are estimating the 'expected' gap between sequent points. When we dealing 
  with non-raw (CACHE) data this value is exact (the 'resolution', window / 
  interval width and maximal allowed number of points '$limit' - is used 
  for calculation).

  Things a little bit more complicated while dialing with raw data. We do not
  actually know how much data points will be found. However, this value should
  be below limit. It would cause very often restarts in the cases if were are
  in reality much less points than allowed.
  To handle that case, we are differntly treating the intervals with a single
  data point inside. If we are out of our algorithm (just starting or after 
  restart) we just puting the points in appropriate place without starting
  algorithm. Of course, if we are already processing the data using MINMAX 
  algorithm, we would continue while the points within allowed range from each
  other. So, summarizing:
    a) if we have high density of points, the distance between points will fit
    in the approximated range. They would be aggregated and there will be no 
    intervals with single point. Therefore, standard processing using MIN-MAX
    algorithm is used. 
    b) if we have low density of points we will have single-point intervals and
    will process them directly.
    c) In the mixed cases, when we have variable density over interval. The 
    MINMAX algorithm would exit on first single-alone (single point, and located
    away from others) point and enter MINMAX mode again then the dense area 
    begins.

  * This all was not about reporting missing data to the user, this gap could 
  be completely OK from data collecting point of view.
  
  Reporting Gaps
  --------------
  If "CACHE::REPORT_EMPTY" mode is switched on and there is intervals without
  data present, this intervals will be reported. Normally, missing data just
  silently ignored.
  
  - Missing points on the edges are not reported.
  - Gaps bellow doubled 'expected' gap (see section Handling Gaps above) are 
  not reported .
  - By default only the gaps above interval size in minimal used CACHE table
  are reported. However, the minimal size could be adjusted in configuration on
  per group basis.

  Expected Position Error
  -----------------------
    Maximal error for this approximation is 2 intervals. Consider
    folowing case: ---.-|-.--- (3 intervals, with short drop within
    one of them). This case could be handled
        a) \\\.///.---
	b) ---.\\\.///
    I.e. in any case the some lines are actually two intervals away
    from the real value


***************************** DATA AGGREGATION ******************************
=============================================================================

DATA GAPS
=========
 There are two parts in the data gap calculation code: creating caching tables,
 aggregating the data from caching tables.
 
 1. The 'missing' attribute in non-raw caching tables are used to store maximal
 gap in the data. The current code have limited precision, the real gap is 
 in the following range: "missing <= real_gap < 3 x missing".
 This is because of edge cases. For example we have two cache levels: 60 and 10,
 and if we have following sequence of 10th: |.   |,|     |,|    .|
 single point in the begining, no points, single point in the end. The current
 algorithm will produce only 'missing=10' for encompassing cache60.
 [ We don't check edges, if there would be no points in the first and third
 intervals, the  result would be correct - 30. ]

 2. For speed reasons we perform aggregation on the database side, not in PHP.
 Unfortunatelly, there is no flexible way to process sequential rows and found
 the distance between them (or when the amount of points drops to 0 and rises
 again) [ actually there is a non standard way for doing that, but it doesn't
 work in php, only from mysql client, and it considers undocumented behaviour
 wich could change in the future versions of mysql. The way is based on 
 variables and sequence of their evaluation. The example is looking, like:

    SELECT UNIX_TIMESTAMP(MIN(time)) AS timestamp, COUNT(*) AS items, (UNIX_TIMESTAMP(MAX(time)) - UNIX_TIMESTAMP(MIN(time))) AS width, MAX(IF(@tmpvar_pos=FLOOR((UNIX_TIMESTAMP(`time`) - 1185781200 )/171.428571429), (UNIX_TIMESTAMP(time)-@tmpvar_width), 0)) AS maxgap, @tmpvar_pos:=FLOOR((UNIX_TIMESTAMP(`time`) - 1185781200 )/171.428571429) AS tmpcol1, @tmpvar_width:=UNIX_TIMESTAMP(time) AS tmpcol2, MIN(v1) AS min0, MAX(v1) AS max0, AVG(v1) AS mean0 FROM `cache0__katrin__hauptspektrometer__0` WHERE ((`time`>= 20070730074000) AND (`time` < 20070730080000)) GROUP BY FLOOR((UNIX_TIMESTAMP(`time`) - 1185781200 )/171.428571429) ORDER BY `time` ASC;

 At the moment we just expect what the data is unifromly distributed over 
 aggregated intervals. This could lead to ever bigger gaps inaccuracy. However,
 approximated gap values still do not exceed real value.
 In the future versions we should implement  the precise way of gap calculation
 using stored functions for processing sequential rows.


******************************* OPTIONS *************************************
=============================================================================

Global options
==============

Various per-group options
=========================
 * fill_raw_first
    Fill RAW cache table completely prior to processing agregating cache tables
    
    This is mainly intended for sources without correct indexes. In such 
    situations selecting the data for a specified time interval make take
    a while (database engines iterates over whole dataset finding matching 
    rows, instead looking in index). Therefore, we are doing only a single
    request to the datasource to make it only once.
    
    The tests, however, shown that this method is slightly faster than default
    one (31s agains 39s) and, therefore, could be used to optimize speed even
    if indexes are okey. [ Actually this is really dependent on relative
    performance of data source server and the caching server and, therefore,
    it's better recheck if it's really optimizion not slowdown. ]

1 by Suren A. Chilingaryan Initial import	1	**************************** BASIC APPROACH *****************************
	2	=============================================================================
	3	* To make fast graphs we are reducing actual number of points for processing.
	4	Basicaly, if we need to display picture 600x600 we need not more than 600
	5	points. To get this points from the (possibly) huge amount of real data we
	6	are using two step algorithm.
	7
	8	First Step
	9	----------
	10	* At first step we dividing the actual time window (whole interval we are
	11	interested in) in the N intervals and aggregating the data over it. For each
	12	interval we know
	13	- Interval start
	14	- Interval width
	15	- Number of data points within interval
	16	- Aggregated values for all channels (MIN, MAX, MEAN)
	17	- In the future versions over aggregated types like (MEDIAN) could be added
	18	- Optionaly we could have information about maximal data gap on the interval
	19	(please consider "DATA GAPS" section for information about precision)
	20
	21	* Of course the database requests obtaining this information over big intervals
	22	would take a lot of time. Therefore, we have CACHE tables containing aggregated
	23	information for few standard interval sizes (configurable depending on the
	24	expected data rate).
	25	When data request is comming, we are:
	26	1) Checking if the required intervals are quite close to one of CACHE
	27	tables and returning information from appropriate CACHE table if so.
	28	2) Otherwise, we are aggregating the data from the biggest possible
	29	CACHE which still would provide enough data points (when aggregated)
	30	3) For small intervals, the aggregation over raw data table is executed.
	31	However, it do not take much time on the small interval size.
	32
	33	* The only problem is cases when we have few small periods with very high
	34	update rate, while the rest of data is sampled with much lower frequences. In
	35	that case it makes no sense (and will take a lot of disk space) to make CACHE
	36	table for aggregation. But amount of data in specific places is to much for
	37	in-place aggregation and, therefore, this places would be displayed with big
	38	delay.
	39	To handle such situation, we are using so called subcaches. [ WILL BE
	40	INTRODUCED in the next version ]
	41
	42	Second Step
	43	-----------
	44	* On second step we are converting intervals to the data points. Actually
	45	we using several approaches to get different view of the data. Just smoth
	46	averaged graph to get impression on the global processes. The MIN-MAX approach
	47	representing all extrim values of data channels, etc.
	48
	49
	50	MINMAX APPROACH
	51	===============
	52	N + 1 points are used for representing N intervals.
	53	* Points are set in the begining, end and between all intervals.
	54	* For placing Y the following MIN-MAX-MIN algorithm is used. Consider two
	55	intervals: (min1,max1), (min2,max2).
	56	The following free points (a,b,c) will be used.
	57	1. starting from minimum: a = min1
	58	2. looking if second interval completely above as
	59	a) if so, puting b = min2
	60	- location: begining of second interval
	61	- next iteration we will continue rise (at least to max2)
	62	b) otherwise, b is MAX(max1, max2)
	63	- location: end of ivl1 or begining of ivl2
	64	- next iteration will umenshatsya.
65	3. c = max2 or min2 depending on previous step
66	* All points are proccessed in the chunk and, therefore, we would have
67	only one extra point compared to number of intervals.
68
69	Handling Gaps
70	-------------
71	* The described procedure should be interupted if we got a gap in the data.
72	Otherwise we would get high precision problems. However, for each brake we
73	would have extra point which would break specified limit of points. In order
74	to handle that problem, we ignoring small gaps and restarting algorithm only
75	if there are enough space in the gap for another point I.E. we normally do
76	not put points in the gap area. Therefore, if we expect a point to be
77	put there (if it would not be a gap), we consider that we have an extra
78	(saved) point and are allowed to restart algorithm.
79
80	* To find a mimimal size of gap, for which we are restarting algorithm, we
81	are estimating the 'expected' gap between sequent points. When we dealing
82	with non-raw (CACHE) data this value is exact (the 'resolution', window /
83	interval width and maximal allowed number of points '$limit' - is used
84	for calculation).
85
86	Things a little bit more complicated while dialing with raw data. We do not
87	actually know how much data points will be found. However, this value should
88	be below limit. It would cause very often restarts in the cases if were are
89	in reality much less points than allowed.
90	To handle that case, we are differntly treating the intervals with a single
91	data point inside. If we are out of our algorithm (just starting or after
92	restart) we just puting the points in appropriate place without starting
93	algorithm. Of course, if we are already processing the data using MINMAX
94	algorithm, we would continue while the points within allowed range from each
95	other. So, summarizing:
96	a) if we have high density of points, the distance between points will fit
97	in the approximated range. They would be aggregated and there will be no
98	intervals with single point. Therefore, standard processing using MIN-MAX
99	algorithm is used.
100	b) if we have low density of points we will have single-point intervals and
101	will process them directly.
102	c) In the mixed cases, when we have variable density over interval. The
103	MINMAX algorithm would exit on first single-alone (single point, and located
104	away from others) point and enter MINMAX mode again then the dense area
105	begins.
106
107	* This all was not about reporting missing data to the user, this gap could
108	be completely OK from data collecting point of view.
109
110	Reporting Gaps
111	--------------
112	If "CACHE::REPORT_EMPTY" mode is switched on and there is intervals without
113	data present, this intervals will be reported. Normally, missing data just
114	silently ignored.
115
116	- Missing points on the edges are not reported.
117	- Gaps bellow doubled 'expected' gap (see section Handling Gaps above) are
118	not reported .
119	- By default only the gaps above interval size in minimal used CACHE table
120	are reported. However, the minimal size could be adjusted in configuration on
121	per group basis.
122
123	Expected Position Error
124	-----------------------
125	Maximal error for this approximation is 2 intervals. Consider
126	folowing case: ---.-\|-.--- (3 intervals, with short drop within
127	one of them). This case could be handled
128	a) \\\.///.---
129	b) ---.\\\.///
130	I.e. in any case the some lines are actually two intervals away
131	from the real value
132
133
134	*************************** DATA AGGREGATION ****************************
135	=============================================================================
136
137	DATA GAPS
138	=========
139	There are two parts in the data gap calculation code: creating caching tables,
140	aggregating the data from caching tables.
141
142	1. The 'missing' attribute in non-raw caching tables are used to store maximal
143	gap in the data. The current code have limited precision, the real gap is
144	in the following range: "missing <= real_gap < 3 x missing".
145	This is because of edge cases. For example we have two cache levels: 60 and 10,
146	and if we have following sequence of 10th: \|. \|,\| \|,\| .\|
147	single point in the begining, no points, single point in the end. The current
148	algorithm will produce only 'missing=10' for encompassing cache60.
149	[ We don't check edges, if there would be no points in the first and third
150	intervals, the result would be correct - 30. ]
151
152	2. For speed reasons we perform aggregation on the database side, not in PHP.
153	Unfortunatelly, there is no flexible way to process sequential rows and found
154	the distance between them (or when the amount of points drops to 0 and rises
155	again) [ actually there is a non standard way for doing that, but it doesn't
156	work in php, only from mysql client, and it considers undocumented behaviour
157	wich could change in the future versions of mysql. The way is based on
158	variables and sequence of their evaluation. The example is looking, like:
159
160	SELECT UNIX_TIMESTAMP(MIN(time)) AS timestamp, COUNT(*) AS items, (UNIX_TIMESTAMP(MAX(time)) - UNIX_TIMESTAMP(MIN(time))) AS width, MAX(IF(@tmpvar_pos=FLOOR((UNIX_TIMESTAMP(`time`) - 1185781200 )/171.428571429), (UNIX_TIMESTAMP(time)-@tmpvar_width), 0)) AS maxgap, @tmpvar_pos:=FLOOR((UNIX_TIMESTAMP(`time`) - 1185781200 )/171.428571429) AS tmpcol1, @tmpvar_width:=UNIX_TIMESTAMP(time) AS tmpcol2, MIN(v1) AS min0, MAX(v1) AS max0, AVG(v1) AS mean0 FROM `cache0__katrin__hauptspektrometer__0` WHERE ((`time`>= 20070730074000) AND (`time` < 20070730080000)) GROUP BY FLOOR((UNIX_TIMESTAMP(`time`) - 1185781200 )/171.428571429) ORDER BY `time` ASC;
161
162	At the moment we just expect what the data is unifromly distributed over
163	aggregated intervals. This could lead to ever bigger gaps inaccuracy. However,
164	approximated gap values still do not exceed real value.
165	In the future versions we should implement the precise way of gap calculation
166	using stored functions for processing sequential rows.
167
168
169	***************************** OPTIONS ***********************************
170	=============================================================================
171
172	Global options
173	==============
174
175	Various per-group options
176	=========================
177	* fill_raw_first
178	Fill RAW cache table completely prior to processing agregating cache tables
179
180	This is mainly intended for sources without correct indexes. In such
181	situations selecting the data for a specified time interval make take
182	a while (database engines iterates over whole dataset finding matching
183	rows, instead looking in index). Therefore, we are doing only a single
184	request to the datasource to make it only once.
185
186	The tests, however, shown that this method is slightly faster than default
187	one (31s agains 39s) and, therefore, could be used to optimize speed even
188	if indexes are okey. [ Actually this is really dependent on relative
189	performance of data source server and the caching server and, therefore,
190	it's better recheck if it's really optimizion not slowdown. ]