Quantcast
Channel: MATLAB Central Newsreader Recent Posts
Viewing all articles
Browse latest Browse all 19628

Re: shuffle without enough memory

$
0
0
"Skirt Zhang" wrote in message <k7jh0a$bov$1@newscl01ah.mathworks.com>...
> My problem is I have 1000 data in columns with real value Pi. The 1-1000 are column indices. I want to calculate the mean&median values for the triple products across all the columns, and also the bi-products of them, i.e. PiPjPk and PiPj, i j k are different from each other.
>
> To make things faster and without violating the memory in Matlab, I calculate the possible 1e6 combinations of the column indices using the shuffle function. Then I calculate the products according to the shuffled index combinations.
>
> I am wondering if there is an efficient and more accurate way to calculate the mean and median ? According to a previous suggestion from Roger, I may solve it as below cited text. but I have problems to implement it :
- - - - - - - -
  Skirt Zhang, after reviewing your recent posting and your earlier one Dec. 26, 2011, I now understand that you have one thousand "return" values and you wish to find the precise median and mean for all possible triple products of distinct values among these. In the codes below, the n=1000 return values are assumed to already be in a vector I call 'R'. (For the much simpler mean value see the last paragraph below.)

  There are two stages in the computation of the median. First, an approximate distribution is found based on a Monte Carlo process in the array P1. The larger P1 is, the more accurate this approximation. Second, the approximate median as determined by P1 is used to choose an initial subrange of values from all possible triple products R(i)*R(j)*R(k), i<j<k, selected so as to fit within array P2. Three counts are kept: 1) C1, the number of products below the subrange and therefore not included in P2, 2) C2, the number of products inserted into P2, and 3) C3, the number above the subrange and also not included. From the final C1, C2, and C3 values, which must sum to M=n*(n-1)*(n-2)/6, it can be determined whether or not the median value lies within P2. If it does, then the range of P2(1:C2) is sorted and that median evaluated. If the median is not within P2, the subrange is
appropriately moved up or down and the whole process repeated. It should be borne in mind that each time a new subrange has to be selected in this manner it requires that all n*(n-1)*(n-2)/6 = 166,167,000 ordered triple products be generated, so an appreciable amount of computation time will be required for each such step.

  The parameter 'f' must be set less than 1. It sets the fraction of the P2 array that would theoretically be used if the distribution in P1 were perfect, so by setting it less than 1 this helps guard against potential overflow of P2. You may want to reduce 'f' if you encounter trouble with P2 overflow or to increase it if it proves to be too cautious. The parameter 'f2' should be slightly less than 2 and provides a small overlap between successive subranges whenever more than one is necessary.

  The following code is somewhat crude and could probably be improved upon with more intelligent adjustment of successive subranges or with better handling of P2 overflow, but this is all I have time for now. I have set very large sizes, N1 and N2, for the two arrays, P1 and P2, in view of your statement that you had 10e6 memory available. You may have to adjust these in accordance with your particular needs, but remember, larger is better!

n = length(R); % = 1000
N1 = 1e6; P1 = zeros(N1,1); % Allocate huge arrays
N2 = 2e6; P2 = zeros(N2,1);
M = n*(n-1)*(n-2)/6; % The total number of possible ordered triple products
f = 3/4; % Overflow protection for P2
f2 = 1.9; % Small overlap provided between successive subranges

% Get approximate distribution
for i1 = 1:N1
  t = ceil(n*rand(3,1));
  while t(1)==t(2)||t(2)==t(3)||t(3)==t(1) % <-- The rejection process
    t = ceil(n*rand(3,1)); % Try again
  end
  P1(i1) = prod(R(t));
end
P1 = sort(P1);
% Approx. median would be (P1(floor((N1+1)/2))+P1(ceil((N1+1)/2)))/2

% Now get exact median
m = (N1+1)/2;
w = f*N2*N1/M/2; % Half the width of the subrange
b = true;
while b
  D = P1(max(ceil(m-w),1)); % Get lower and upper bounds on subrange
  U = P1(min(floor(m+w),N1)); % from the P1 array
  C1 = 0; C2 = 0; C3 = 0; % Start all counts at zero
  for i1 = 1:n-2 % Generate all M possible ordered triple products
    for i2 = i1+1:n-1
      p12 = R(i1)*R(i2);
      for i3 = i2+1:n
        p = p12*R(i3);
        if p <= U
          if p >= D
            C2 = C2 + 1; % Product lies within subrange
            if C2 > N2
              error('Exceeded P2 array limit')
            end
            P2(C2) = p; % Append it to P2
          else
            C1 = C1 + 1; % Product is below subrange
          end
        else
          C3 = C3 + 1; % Product is above subrange
        end
      end
    end
  end
  if C1 > floor((M-1)/2)
    m = m - f2*w;
    fprintf('Adjust P2 subrange downwards\n')
  elseif C3 > floor((M-1)/2)
    m = m + f2*w;
    fprintf('Adjust P2 subrange upwards\n')
  else
    b = false; % Median was within P2, exit while loop
  end
end
P2(1:C2) = sort(P2(1:C2)); % Sort in preparation for median computation
med = (P2(floor((M+1)/2)-C1)+P2(ceil((M+1)/2)-C1))/2; % The precise median
fprintf('Median = %23.16e\n',med)

  The computation for the mean is far, far simpler than the above and requires no large storage. The mean value of triple products on R is equal to:

 mv = (sum(R)^3-3*sum(R.^2)*sum(R)+2*sum(R.^3))/(n*(n-1)*(n-2));

  (Note: If you want to check the validity of these algorithms, use short test R vectors with appropriately reduced P1 and P2 sizes and compare those results with a direct matlab median and mean computation of a fully stored set of triple products thereof.)

Roger Stafford

Viewing all articles
Browse latest Browse all 19628

Trending Articles