Radix2 Fast Fourier Transform implemented in C++Matlab code demonstrating use of fft (Fast Fourier Transform)Fourier transformationComputing a Discrete Cosine Transform matrixFast integer handlingBode plot filters implemented in JavaScriptFast Document Distance, C++Iterative algorithm to convert a signal to Fourier seriesDFT (Discrete Fourier Transform) Algorithm in SwiftSpectrum Analysis with Discrete Fourier Transform

What modifiers are added to the attack and damage rolls of this unique longbow from Waterdeep: Dragon Heist?

Why does this image of cyclocarbon look like a nonagon?

Can anybody tell me who this Pokemon is?

What should I do with the stock I own if I anticipate there will be a recession?

Can I submit a paper computer science conference using an alias if using my real name can cause legal trouble in my original country

Airline power sockets shut down when I plug my computer in. How can I avoid that?

Why is the battery jumpered to a resistor in this schematic?

Have there ever been other TV shows or Films that told a similiar story to the new 90210 show?

Will some rockets really collapse under their own weight?

How to open terminal automatically when ubuntu boots up?

Is it alright to say good afternoon Sirs and Madams in a panel interview?

Why don't modern jet engines use forced exhaust mixing?

Parse a simple key=value config file in C

Alignment of different align environment

What does a comma signify in inorganic chemistry?

What was the intention with the Commodore 128?

When and which board game was the first to be ever invented?

What is the opposite of "hunger level"?

Vegetarian dishes on Russian trains (European part)

Number of matrices with bounded products of rows and columns

What's the point of writing that I know will never be used or read?

Did Michelle Obama have a staff of 23; and Melania have a staff of 4?

Why was ramjet fuel used as hydraulic fluid during Saturn V checkout?

Unsolved Problems due to Lack of Computational Power



Radix2 Fast Fourier Transform implemented in C++


Matlab code demonstrating use of fft (Fast Fourier Transform)Fourier transformationComputing a Discrete Cosine Transform matrixFast integer handlingBode plot filters implemented in JavaScriptFast Document Distance, C++Iterative algorithm to convert a signal to Fourier seriesDFT (Discrete Fourier Transform) Algorithm in SwiftSpectrum Analysis with Discrete Fourier Transform






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








3












$begingroup$



Note: If you don't know much about Fourier transform algorithms, a simple review of whether I am doing anything inefficient with C++ in general would be appreciated.




I've been working on implementing an efficient Radix2 Fast Fourier Transform in C++ and I seem to have hit a roadblock. I have optimized it in every possible way I can think of and it is very fast, but when comparing it to the Numpy FFT in Python it is still significantly slower. Note that my FFT is not done in-place, but neither is the Python implementation so I should be able to achieve at least the same efficiency as Numpy.



I've already taken advantage of symmetry when the input is a real signal which allows the use of an N/2 point FFT for an N-length real signal, and I also pre-compute all of the twiddle factors and optimize the twiddle factor calculation so redundant twiddle factors are not re-calculated.



The code:



#include <cassert>
#include <complex>
#include <vector>

// To demonstrate runtime
#include <chrono>
#include <iostream>

static std::vector<std::complex<double>> FFTRadix2(const std::vector<std::complex<double>>& x, const std::vector<std::complex<double>>& W);
static bool IsPowerOf2(size_t x);
static size_t ReverseBits(const size_t x, const size_t n);

std::vector<std::complex<double>> FFT(const std::vector<double>& x)

size_t N = x.size();

// Radix2 FFT requires length of the input signal to be a power of 2
// TODO: Implement other algorithms for when N is not a power of 2
assert(IsPowerOf2(N));

// Taking advantage of symmetry the FFT of a real signal can be computed
// using a single N/2-point complex FFT. Split the input signal into its
// even and odd components and load the data into a single complex vector.
std::vector<std::complex<double>> x_p(N / 2);
for (size_t n = 0; n < N / 2; ++n)

// x_p[n] = x[2n] + jx[2n + 1]
x_p[n] = std::complex<double>(x[2 * n], x[2 * n + 1]);


// Pre-calculate twiddle factors
std::vector<std::complex<double>> W(N / 2);
std::vector<std::complex<double>> W_p(N / 4);
for (size_t k = 0; k < N / 2; ++k)

W[k] = std::polar(1.0, -2 * M_PI * k / N);

// The N/2-point complex DFT uses only the even twiddle factors
if (k % 2 == 0)

W_p[k / 2] = W[k];



// Perform the N/2-point complex FFT
std::vector<std::complex<double>> X_p = FFTRadix2(x_p, W_p);

// Extract the N-point FFT of the real signal from the results
std::vector<std::complex<double>> X(N);
X[0] = X_p[0].real() + X_p[0].imag();
for (size_t k = 1; k < N / 2; ++k)

// Extract the FFT of the even components
auto A = std::complex<double>(
(X_p[k].real() + X_p[N / 2 - k].real()) / 2,
(X_p[k].imag() - X_p[N / 2 - k].imag()) / 2);

// Extract the FFT of the odd components
auto B = std::complex<double>(
(X_p[N / 2 - k].imag() + X_p[k].imag()) / 2,
(X_p[N / 2 - k].real() - X_p[k].real()) / 2);

// Sum the results and take advantage of symmetry
X[k] = A + W[k] * B;
X[k + N / 2] = A - W[k] * B;


return X;


std::vector<std::complex<double>> FFT(const std::vector<std::complex<double>>& x)

size_t N = x.size();

// Radix2 FFT requires length of the input signal to be a power of 2
// TODO: Implement other algorithms for when N is not a power of 2
assert(IsPowerOf2(N));

// Pre-calculate twiddle factors
std::vector<std::complex<double>> W(N / 2);
for (size_t k = 0; k < N / 2; ++k)

W[k] = std::polar(1.0, -2 * M_PI * k / N);


return FFTRadix2(x, W);


static std::vector<std::complex<double>> FFTRadix2(const std::vector<std::complex<double>>& x, const std::vector<std::complex<double>>& W)

size_t N = x.size();

// Radix2 FFT requires length of the input signal to be a power of 2
assert(IsPowerOf2(N));

// Calculate how many stages the FFT must compute
size_t stages = static_cast<size_t>(log2(N));

// Pre-load the output vector with the input data using bit-reversed indexes
std::vector<std::complex<double>> X(N);
for (size_t n = 0; n < N; ++n)

X[n] = x[ReverseBits(n, stages)];


// Calculate the FFT one stage at a time and sum the results
for (size_t stage = 1; stage <= stages; ++stage)

size_t N_stage = static_cast<size_t>(std::pow(2, stage));
size_t W_offset = static_cast<size_t>(std::pow(2, stages - stage));
for (size_t k = 0; k < N; k += N_stage)

for (size_t n = 0; n < N_stage / 2; ++n)

auto tmp = X[k + n];
X[k + n] = tmp + W[n * W_offset] * X[k + n + N_stage / 2];
X[k + n + N_stage / 2] = tmp - W[n * W_offset] * X[k + n + N_stage / 2];




return X;


// Returns true if x is a power of 2
static bool IsPowerOf2(size_t x)

return x && (!(x & (x - 1)));


// Given x composed of n bits, returns x with the bits reversed
static size_t ReverseBits(const size_t x, const size_t n)

size_t xReversed = 0;
for (size_t i = 0; i < n; ++i)
((x >> i) & 1U);


return xReversed;


int main()

size_t N = 16777216;
std::vector<double> x(N);

int f_s = 8000;
double t_s = 1.0 / f_s;

for (size_t n = 0; n < N; ++n)

x[n] = std::sin(2 * M_PI * 1000 * n * t_s)
+ 0.5 * std::sin(2 * M_PI * 2000 * n * t_s + 3 * M_PI / 4);


auto start = std::chrono::high_resolution_clock::now();
auto X = FFT(x);
auto stop = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);

std::cout << duration.count() << std::endl;



Output (running a few times and averaging):




3671677




This was compiled in Visual Studio 2019 in Release mode with the /O2, /Oi and /Ot optimization compiler flags to try and squeeze as much speed as possible out of it.




A comparable snippet of Python code that uses the Numpy FFT is shown below:



import numpy as np
import datetime

N = 16777216
f_s = 8000.0
t_s = 1/f_s

t = np.arange(0, N*t_s, t_s)
y = np.sin(2*np.pi*1000*t) + 0.5*np.sin(2*np.pi*2000*t + 3*np.pi/4)

start = datetime.datetime.now()
Y = np.fft.fft(y)
stop = datetime.datetime.now()
duration = stop - start

print(duration.total_seconds()*1e6)


Output (running a few times and averaging):




2100411.0





As you can see, the Python implementation is still faster by about 43%, but I can't think of any ways my implementation can be improved.



From what I understand, the Numpy version is actually implemented with C code underneath so I'm not terribly disappointed in the performance of my own code, but it still leaves me wondering what I am missing that I could still do better?










share|improve this question











$endgroup$













  • $begingroup$
    Can you give a hint about the used C++ standard (11, 14, 17) Can you use additional libraries such as gsl or abseil?
    $endgroup$
    – miscco
    8 hours ago










  • $begingroup$
    @miscco, C++11 is currently used but C++17 is available. Additional libraries are not out of the question. If they are available through the Conan package manager that would be a huge plus. Also it is important that it stays cross platform compatible so OS specific libraries are a no go.
    $endgroup$
    – tjwrona1992
    8 hours ago


















3












$begingroup$



Note: If you don't know much about Fourier transform algorithms, a simple review of whether I am doing anything inefficient with C++ in general would be appreciated.




I've been working on implementing an efficient Radix2 Fast Fourier Transform in C++ and I seem to have hit a roadblock. I have optimized it in every possible way I can think of and it is very fast, but when comparing it to the Numpy FFT in Python it is still significantly slower. Note that my FFT is not done in-place, but neither is the Python implementation so I should be able to achieve at least the same efficiency as Numpy.



I've already taken advantage of symmetry when the input is a real signal which allows the use of an N/2 point FFT for an N-length real signal, and I also pre-compute all of the twiddle factors and optimize the twiddle factor calculation so redundant twiddle factors are not re-calculated.



The code:



#include <cassert>
#include <complex>
#include <vector>

// To demonstrate runtime
#include <chrono>
#include <iostream>

static std::vector<std::complex<double>> FFTRadix2(const std::vector<std::complex<double>>& x, const std::vector<std::complex<double>>& W);
static bool IsPowerOf2(size_t x);
static size_t ReverseBits(const size_t x, const size_t n);

std::vector<std::complex<double>> FFT(const std::vector<double>& x)

size_t N = x.size();

// Radix2 FFT requires length of the input signal to be a power of 2
// TODO: Implement other algorithms for when N is not a power of 2
assert(IsPowerOf2(N));

// Taking advantage of symmetry the FFT of a real signal can be computed
// using a single N/2-point complex FFT. Split the input signal into its
// even and odd components and load the data into a single complex vector.
std::vector<std::complex<double>> x_p(N / 2);
for (size_t n = 0; n < N / 2; ++n)

// x_p[n] = x[2n] + jx[2n + 1]
x_p[n] = std::complex<double>(x[2 * n], x[2 * n + 1]);


// Pre-calculate twiddle factors
std::vector<std::complex<double>> W(N / 2);
std::vector<std::complex<double>> W_p(N / 4);
for (size_t k = 0; k < N / 2; ++k)

W[k] = std::polar(1.0, -2 * M_PI * k / N);

// The N/2-point complex DFT uses only the even twiddle factors
if (k % 2 == 0)

W_p[k / 2] = W[k];



// Perform the N/2-point complex FFT
std::vector<std::complex<double>> X_p = FFTRadix2(x_p, W_p);

// Extract the N-point FFT of the real signal from the results
std::vector<std::complex<double>> X(N);
X[0] = X_p[0].real() + X_p[0].imag();
for (size_t k = 1; k < N / 2; ++k)

// Extract the FFT of the even components
auto A = std::complex<double>(
(X_p[k].real() + X_p[N / 2 - k].real()) / 2,
(X_p[k].imag() - X_p[N / 2 - k].imag()) / 2);

// Extract the FFT of the odd components
auto B = std::complex<double>(
(X_p[N / 2 - k].imag() + X_p[k].imag()) / 2,
(X_p[N / 2 - k].real() - X_p[k].real()) / 2);

// Sum the results and take advantage of symmetry
X[k] = A + W[k] * B;
X[k + N / 2] = A - W[k] * B;


return X;


std::vector<std::complex<double>> FFT(const std::vector<std::complex<double>>& x)

size_t N = x.size();

// Radix2 FFT requires length of the input signal to be a power of 2
// TODO: Implement other algorithms for when N is not a power of 2
assert(IsPowerOf2(N));

// Pre-calculate twiddle factors
std::vector<std::complex<double>> W(N / 2);
for (size_t k = 0; k < N / 2; ++k)

W[k] = std::polar(1.0, -2 * M_PI * k / N);


return FFTRadix2(x, W);


static std::vector<std::complex<double>> FFTRadix2(const std::vector<std::complex<double>>& x, const std::vector<std::complex<double>>& W)

size_t N = x.size();

// Radix2 FFT requires length of the input signal to be a power of 2
assert(IsPowerOf2(N));

// Calculate how many stages the FFT must compute
size_t stages = static_cast<size_t>(log2(N));

// Pre-load the output vector with the input data using bit-reversed indexes
std::vector<std::complex<double>> X(N);
for (size_t n = 0; n < N; ++n)

X[n] = x[ReverseBits(n, stages)];


// Calculate the FFT one stage at a time and sum the results
for (size_t stage = 1; stage <= stages; ++stage)

size_t N_stage = static_cast<size_t>(std::pow(2, stage));
size_t W_offset = static_cast<size_t>(std::pow(2, stages - stage));
for (size_t k = 0; k < N; k += N_stage)

for (size_t n = 0; n < N_stage / 2; ++n)

auto tmp = X[k + n];
X[k + n] = tmp + W[n * W_offset] * X[k + n + N_stage / 2];
X[k + n + N_stage / 2] = tmp - W[n * W_offset] * X[k + n + N_stage / 2];




return X;


// Returns true if x is a power of 2
static bool IsPowerOf2(size_t x)

return x && (!(x & (x - 1)));


// Given x composed of n bits, returns x with the bits reversed
static size_t ReverseBits(const size_t x, const size_t n)

size_t xReversed = 0;
for (size_t i = 0; i < n; ++i)
((x >> i) & 1U);


return xReversed;


int main()

size_t N = 16777216;
std::vector<double> x(N);

int f_s = 8000;
double t_s = 1.0 / f_s;

for (size_t n = 0; n < N; ++n)

x[n] = std::sin(2 * M_PI * 1000 * n * t_s)
+ 0.5 * std::sin(2 * M_PI * 2000 * n * t_s + 3 * M_PI / 4);


auto start = std::chrono::high_resolution_clock::now();
auto X = FFT(x);
auto stop = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);

std::cout << duration.count() << std::endl;



Output (running a few times and averaging):




3671677




This was compiled in Visual Studio 2019 in Release mode with the /O2, /Oi and /Ot optimization compiler flags to try and squeeze as much speed as possible out of it.




A comparable snippet of Python code that uses the Numpy FFT is shown below:



import numpy as np
import datetime

N = 16777216
f_s = 8000.0
t_s = 1/f_s

t = np.arange(0, N*t_s, t_s)
y = np.sin(2*np.pi*1000*t) + 0.5*np.sin(2*np.pi*2000*t + 3*np.pi/4)

start = datetime.datetime.now()
Y = np.fft.fft(y)
stop = datetime.datetime.now()
duration = stop - start

print(duration.total_seconds()*1e6)


Output (running a few times and averaging):




2100411.0





As you can see, the Python implementation is still faster by about 43%, but I can't think of any ways my implementation can be improved.



From what I understand, the Numpy version is actually implemented with C code underneath so I'm not terribly disappointed in the performance of my own code, but it still leaves me wondering what I am missing that I could still do better?










share|improve this question











$endgroup$













  • $begingroup$
    Can you give a hint about the used C++ standard (11, 14, 17) Can you use additional libraries such as gsl or abseil?
    $endgroup$
    – miscco
    8 hours ago










  • $begingroup$
    @miscco, C++11 is currently used but C++17 is available. Additional libraries are not out of the question. If they are available through the Conan package manager that would be a huge plus. Also it is important that it stays cross platform compatible so OS specific libraries are a no go.
    $endgroup$
    – tjwrona1992
    8 hours ago














3












3








3


1



$begingroup$



Note: If you don't know much about Fourier transform algorithms, a simple review of whether I am doing anything inefficient with C++ in general would be appreciated.




I've been working on implementing an efficient Radix2 Fast Fourier Transform in C++ and I seem to have hit a roadblock. I have optimized it in every possible way I can think of and it is very fast, but when comparing it to the Numpy FFT in Python it is still significantly slower. Note that my FFT is not done in-place, but neither is the Python implementation so I should be able to achieve at least the same efficiency as Numpy.



I've already taken advantage of symmetry when the input is a real signal which allows the use of an N/2 point FFT for an N-length real signal, and I also pre-compute all of the twiddle factors and optimize the twiddle factor calculation so redundant twiddle factors are not re-calculated.



The code:



#include <cassert>
#include <complex>
#include <vector>

// To demonstrate runtime
#include <chrono>
#include <iostream>

static std::vector<std::complex<double>> FFTRadix2(const std::vector<std::complex<double>>& x, const std::vector<std::complex<double>>& W);
static bool IsPowerOf2(size_t x);
static size_t ReverseBits(const size_t x, const size_t n);

std::vector<std::complex<double>> FFT(const std::vector<double>& x)

size_t N = x.size();

// Radix2 FFT requires length of the input signal to be a power of 2
// TODO: Implement other algorithms for when N is not a power of 2
assert(IsPowerOf2(N));

// Taking advantage of symmetry the FFT of a real signal can be computed
// using a single N/2-point complex FFT. Split the input signal into its
// even and odd components and load the data into a single complex vector.
std::vector<std::complex<double>> x_p(N / 2);
for (size_t n = 0; n < N / 2; ++n)

// x_p[n] = x[2n] + jx[2n + 1]
x_p[n] = std::complex<double>(x[2 * n], x[2 * n + 1]);


// Pre-calculate twiddle factors
std::vector<std::complex<double>> W(N / 2);
std::vector<std::complex<double>> W_p(N / 4);
for (size_t k = 0; k < N / 2; ++k)

W[k] = std::polar(1.0, -2 * M_PI * k / N);

// The N/2-point complex DFT uses only the even twiddle factors
if (k % 2 == 0)

W_p[k / 2] = W[k];



// Perform the N/2-point complex FFT
std::vector<std::complex<double>> X_p = FFTRadix2(x_p, W_p);

// Extract the N-point FFT of the real signal from the results
std::vector<std::complex<double>> X(N);
X[0] = X_p[0].real() + X_p[0].imag();
for (size_t k = 1; k < N / 2; ++k)

// Extract the FFT of the even components
auto A = std::complex<double>(
(X_p[k].real() + X_p[N / 2 - k].real()) / 2,
(X_p[k].imag() - X_p[N / 2 - k].imag()) / 2);

// Extract the FFT of the odd components
auto B = std::complex<double>(
(X_p[N / 2 - k].imag() + X_p[k].imag()) / 2,
(X_p[N / 2 - k].real() - X_p[k].real()) / 2);

// Sum the results and take advantage of symmetry
X[k] = A + W[k] * B;
X[k + N / 2] = A - W[k] * B;


return X;


std::vector<std::complex<double>> FFT(const std::vector<std::complex<double>>& x)

size_t N = x.size();

// Radix2 FFT requires length of the input signal to be a power of 2
// TODO: Implement other algorithms for when N is not a power of 2
assert(IsPowerOf2(N));

// Pre-calculate twiddle factors
std::vector<std::complex<double>> W(N / 2);
for (size_t k = 0; k < N / 2; ++k)

W[k] = std::polar(1.0, -2 * M_PI * k / N);


return FFTRadix2(x, W);


static std::vector<std::complex<double>> FFTRadix2(const std::vector<std::complex<double>>& x, const std::vector<std::complex<double>>& W)

size_t N = x.size();

// Radix2 FFT requires length of the input signal to be a power of 2
assert(IsPowerOf2(N));

// Calculate how many stages the FFT must compute
size_t stages = static_cast<size_t>(log2(N));

// Pre-load the output vector with the input data using bit-reversed indexes
std::vector<std::complex<double>> X(N);
for (size_t n = 0; n < N; ++n)

X[n] = x[ReverseBits(n, stages)];


// Calculate the FFT one stage at a time and sum the results
for (size_t stage = 1; stage <= stages; ++stage)

size_t N_stage = static_cast<size_t>(std::pow(2, stage));
size_t W_offset = static_cast<size_t>(std::pow(2, stages - stage));
for (size_t k = 0; k < N; k += N_stage)

for (size_t n = 0; n < N_stage / 2; ++n)

auto tmp = X[k + n];
X[k + n] = tmp + W[n * W_offset] * X[k + n + N_stage / 2];
X[k + n + N_stage / 2] = tmp - W[n * W_offset] * X[k + n + N_stage / 2];




return X;


// Returns true if x is a power of 2
static bool IsPowerOf2(size_t x)

return x && (!(x & (x - 1)));


// Given x composed of n bits, returns x with the bits reversed
static size_t ReverseBits(const size_t x, const size_t n)

size_t xReversed = 0;
for (size_t i = 0; i < n; ++i)
((x >> i) & 1U);


return xReversed;


int main()

size_t N = 16777216;
std::vector<double> x(N);

int f_s = 8000;
double t_s = 1.0 / f_s;

for (size_t n = 0; n < N; ++n)

x[n] = std::sin(2 * M_PI * 1000 * n * t_s)
+ 0.5 * std::sin(2 * M_PI * 2000 * n * t_s + 3 * M_PI / 4);


auto start = std::chrono::high_resolution_clock::now();
auto X = FFT(x);
auto stop = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);

std::cout << duration.count() << std::endl;



Output (running a few times and averaging):




3671677




This was compiled in Visual Studio 2019 in Release mode with the /O2, /Oi and /Ot optimization compiler flags to try and squeeze as much speed as possible out of it.




A comparable snippet of Python code that uses the Numpy FFT is shown below:



import numpy as np
import datetime

N = 16777216
f_s = 8000.0
t_s = 1/f_s

t = np.arange(0, N*t_s, t_s)
y = np.sin(2*np.pi*1000*t) + 0.5*np.sin(2*np.pi*2000*t + 3*np.pi/4)

start = datetime.datetime.now()
Y = np.fft.fft(y)
stop = datetime.datetime.now()
duration = stop - start

print(duration.total_seconds()*1e6)


Output (running a few times and averaging):




2100411.0





As you can see, the Python implementation is still faster by about 43%, but I can't think of any ways my implementation can be improved.



From what I understand, the Numpy version is actually implemented with C code underneath so I'm not terribly disappointed in the performance of my own code, but it still leaves me wondering what I am missing that I could still do better?










share|improve this question











$endgroup$





Note: If you don't know much about Fourier transform algorithms, a simple review of whether I am doing anything inefficient with C++ in general would be appreciated.




I've been working on implementing an efficient Radix2 Fast Fourier Transform in C++ and I seem to have hit a roadblock. I have optimized it in every possible way I can think of and it is very fast, but when comparing it to the Numpy FFT in Python it is still significantly slower. Note that my FFT is not done in-place, but neither is the Python implementation so I should be able to achieve at least the same efficiency as Numpy.



I've already taken advantage of symmetry when the input is a real signal which allows the use of an N/2 point FFT for an N-length real signal, and I also pre-compute all of the twiddle factors and optimize the twiddle factor calculation so redundant twiddle factors are not re-calculated.



The code:



#include <cassert>
#include <complex>
#include <vector>

// To demonstrate runtime
#include <chrono>
#include <iostream>

static std::vector<std::complex<double>> FFTRadix2(const std::vector<std::complex<double>>& x, const std::vector<std::complex<double>>& W);
static bool IsPowerOf2(size_t x);
static size_t ReverseBits(const size_t x, const size_t n);

std::vector<std::complex<double>> FFT(const std::vector<double>& x)

size_t N = x.size();

// Radix2 FFT requires length of the input signal to be a power of 2
// TODO: Implement other algorithms for when N is not a power of 2
assert(IsPowerOf2(N));

// Taking advantage of symmetry the FFT of a real signal can be computed
// using a single N/2-point complex FFT. Split the input signal into its
// even and odd components and load the data into a single complex vector.
std::vector<std::complex<double>> x_p(N / 2);
for (size_t n = 0; n < N / 2; ++n)

// x_p[n] = x[2n] + jx[2n + 1]
x_p[n] = std::complex<double>(x[2 * n], x[2 * n + 1]);


// Pre-calculate twiddle factors
std::vector<std::complex<double>> W(N / 2);
std::vector<std::complex<double>> W_p(N / 4);
for (size_t k = 0; k < N / 2; ++k)

W[k] = std::polar(1.0, -2 * M_PI * k / N);

// The N/2-point complex DFT uses only the even twiddle factors
if (k % 2 == 0)

W_p[k / 2] = W[k];



// Perform the N/2-point complex FFT
std::vector<std::complex<double>> X_p = FFTRadix2(x_p, W_p);

// Extract the N-point FFT of the real signal from the results
std::vector<std::complex<double>> X(N);
X[0] = X_p[0].real() + X_p[0].imag();
for (size_t k = 1; k < N / 2; ++k)

// Extract the FFT of the even components
auto A = std::complex<double>(
(X_p[k].real() + X_p[N / 2 - k].real()) / 2,
(X_p[k].imag() - X_p[N / 2 - k].imag()) / 2);

// Extract the FFT of the odd components
auto B = std::complex<double>(
(X_p[N / 2 - k].imag() + X_p[k].imag()) / 2,
(X_p[N / 2 - k].real() - X_p[k].real()) / 2);

// Sum the results and take advantage of symmetry
X[k] = A + W[k] * B;
X[k + N / 2] = A - W[k] * B;


return X;


std::vector<std::complex<double>> FFT(const std::vector<std::complex<double>>& x)

size_t N = x.size();

// Radix2 FFT requires length of the input signal to be a power of 2
// TODO: Implement other algorithms for when N is not a power of 2
assert(IsPowerOf2(N));

// Pre-calculate twiddle factors
std::vector<std::complex<double>> W(N / 2);
for (size_t k = 0; k < N / 2; ++k)

W[k] = std::polar(1.0, -2 * M_PI * k / N);


return FFTRadix2(x, W);


static std::vector<std::complex<double>> FFTRadix2(const std::vector<std::complex<double>>& x, const std::vector<std::complex<double>>& W)

size_t N = x.size();

// Radix2 FFT requires length of the input signal to be a power of 2
assert(IsPowerOf2(N));

// Calculate how many stages the FFT must compute
size_t stages = static_cast<size_t>(log2(N));

// Pre-load the output vector with the input data using bit-reversed indexes
std::vector<std::complex<double>> X(N);
for (size_t n = 0; n < N; ++n)

X[n] = x[ReverseBits(n, stages)];


// Calculate the FFT one stage at a time and sum the results
for (size_t stage = 1; stage <= stages; ++stage)

size_t N_stage = static_cast<size_t>(std::pow(2, stage));
size_t W_offset = static_cast<size_t>(std::pow(2, stages - stage));
for (size_t k = 0; k < N; k += N_stage)

for (size_t n = 0; n < N_stage / 2; ++n)

auto tmp = X[k + n];
X[k + n] = tmp + W[n * W_offset] * X[k + n + N_stage / 2];
X[k + n + N_stage / 2] = tmp - W[n * W_offset] * X[k + n + N_stage / 2];




return X;


// Returns true if x is a power of 2
static bool IsPowerOf2(size_t x)

return x && (!(x & (x - 1)));


// Given x composed of n bits, returns x with the bits reversed
static size_t ReverseBits(const size_t x, const size_t n)

size_t xReversed = 0;
for (size_t i = 0; i < n; ++i)
((x >> i) & 1U);


return xReversed;


int main()

size_t N = 16777216;
std::vector<double> x(N);

int f_s = 8000;
double t_s = 1.0 / f_s;

for (size_t n = 0; n < N; ++n)

x[n] = std::sin(2 * M_PI * 1000 * n * t_s)
+ 0.5 * std::sin(2 * M_PI * 2000 * n * t_s + 3 * M_PI / 4);


auto start = std::chrono::high_resolution_clock::now();
auto X = FFT(x);
auto stop = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);

std::cout << duration.count() << std::endl;



Output (running a few times and averaging):




3671677




This was compiled in Visual Studio 2019 in Release mode with the /O2, /Oi and /Ot optimization compiler flags to try and squeeze as much speed as possible out of it.




A comparable snippet of Python code that uses the Numpy FFT is shown below:



import numpy as np
import datetime

N = 16777216
f_s = 8000.0
t_s = 1/f_s

t = np.arange(0, N*t_s, t_s)
y = np.sin(2*np.pi*1000*t) + 0.5*np.sin(2*np.pi*2000*t + 3*np.pi/4)

start = datetime.datetime.now()
Y = np.fft.fft(y)
stop = datetime.datetime.now()
duration = stop - start

print(duration.total_seconds()*1e6)


Output (running a few times and averaging):




2100411.0





As you can see, the Python implementation is still faster by about 43%, but I can't think of any ways my implementation can be improved.



From what I understand, the Numpy version is actually implemented with C code underneath so I'm not terribly disappointed in the performance of my own code, but it still leaves me wondering what I am missing that I could still do better?







c++ performance algorithm signal-processing complex-numbers






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 8 hours ago









200_success

135k21 gold badges173 silver badges443 bronze badges




135k21 gold badges173 silver badges443 bronze badges










asked 10 hours ago









tjwrona1992tjwrona1992

1941 silver badge9 bronze badges




1941 silver badge9 bronze badges














  • $begingroup$
    Can you give a hint about the used C++ standard (11, 14, 17) Can you use additional libraries such as gsl or abseil?
    $endgroup$
    – miscco
    8 hours ago










  • $begingroup$
    @miscco, C++11 is currently used but C++17 is available. Additional libraries are not out of the question. If they are available through the Conan package manager that would be a huge plus. Also it is important that it stays cross platform compatible so OS specific libraries are a no go.
    $endgroup$
    – tjwrona1992
    8 hours ago

















  • $begingroup$
    Can you give a hint about the used C++ standard (11, 14, 17) Can you use additional libraries such as gsl or abseil?
    $endgroup$
    – miscco
    8 hours ago










  • $begingroup$
    @miscco, C++11 is currently used but C++17 is available. Additional libraries are not out of the question. If they are available through the Conan package manager that would be a huge plus. Also it is important that it stays cross platform compatible so OS specific libraries are a no go.
    $endgroup$
    – tjwrona1992
    8 hours ago
















$begingroup$
Can you give a hint about the used C++ standard (11, 14, 17) Can you use additional libraries such as gsl or abseil?
$endgroup$
– miscco
8 hours ago




$begingroup$
Can you give a hint about the used C++ standard (11, 14, 17) Can you use additional libraries such as gsl or abseil?
$endgroup$
– miscco
8 hours ago












$begingroup$
@miscco, C++11 is currently used but C++17 is available. Additional libraries are not out of the question. If they are available through the Conan package manager that would be a huge plus. Also it is important that it stays cross platform compatible so OS specific libraries are a no go.
$endgroup$
– tjwrona1992
8 hours ago





$begingroup$
@miscco, C++11 is currently used but C++17 is available. Additional libraries are not out of the question. If they are available through the Conan package manager that would be a huge plus. Also it is important that it stays cross platform compatible so OS specific libraries are a no go.
$endgroup$
– tjwrona1992
8 hours ago











2 Answers
2






active

oldest

votes


















3












$begingroup$

Putting this through the built-in profiler reveals some hot spots. Perhaps surprisingly: ReverseBits. It's not the biggest thing in the list, but it is significant while it shouldn't be.



You could use one of the many alternate ways to implement ReverseBits, or the sequence of bit-reversed indexes (which does not require reversing all the indexes), or the overall bit-reversal permutation (which does not require bit reversals).



For example here is a way to compute the sequence of bit-reversed indexes without explicitly reversing any index:



for (size_t n = 0, rev = 0; n < N; ++n)

X[n] = x[rev];
size_t change = n ^ (n + 1);
#if _WIN64
rev ^= change << (__lzcnt64(change) - (64 - stages));
#else
rev ^= change << (__lzcnt(change) - (32 - stages));
#endif



On my PC, that reduces the time from around 2.8 million microseconds to 2.3 million microseconds.



This trick works by using that the XOR between adjacent indexes is a mask of ones up to and including the least significant zero (the +1 borrows through the least significant set bits and into that least significant zero), which has a form that can be reversed by just shifting it. The reversed mask is then the XOR between adjacent reversed indexes, so applying it to the current reversed index with XOR increments it.



__lzcnt64 and _WIN64 are for MSVC, you could use more preprocessor tricks to find the right intrinsic and bitness-detection for the current compiler. Leading zero count can be avoided by using std::bitset and its count method:



size_t change = n ^ (n + 1);
std::bitset<64> bits(~change);
rev ^= change << (bits.count() - (64 - stages));


count is recognized by GCC and Clang as an intrinsic for popcnt, but it seems not by MSVC, so it is not reliable for high performance scenarios.



Secondly, there is a repeated expression: W[n * W_offset] * X[k + n + N_stage / 2]. The compiler is often relied on to remove such duplication, but here it didn't happen. Factoring that out reduced the time to under 2 million microseconds.



Computing the twiddle factors takes a bit more time than it needs to. They are powers of the first non-trivial twiddle factor, and could be computed iteratively that way. This suffers from some build-up of inaccuracy, which could be improved by periodically resetting to the proper value computed by std::polar. For example,



auto twiddle_step = std::polar(1.0, -2.0 * M_PI / N);
auto twiddle_current = std::polar(1.0, 0.0);
for (size_t k = 0; k < N / 2; ++k)

if ((k & 0xFFF) == 0)
twiddle_current = std::polar(1.0, -2.0 * M_PI * k / N);
W[k] = twiddle_current;
twiddle_current *= twiddle_step;

// The N/2-point complex DFT uses only the even twiddle factors
if (k % 2 == 0)

W_p[k / 2] = W[k];




On my PC that reduces the time from hovering around 1.95 million µs to around 1.85 million µs, not a huge difference but easily measurable.



More advanced: use SSE3 for the main calculation, for example (not well tested, but seems to work so far)



__m128d w_real = _mm_set1_pd(W[n * W_offset].real());
__m128d w_imag = _mm_set1_pd(W[n * W_offset].imag());
__m128d z = _mm_loadu_pd(reinterpret_cast<double*>(&X[k + n + N_stage / 2]));
__m128d z_rev = _mm_shuffle_pd(z, z, 1);
__m128d t = _mm_addsub_pd(_mm_mul_pd(w_real, z), _mm_mul_pd(w_imag, z_rev));

__m128d x = _mm_loadu_pd(reinterpret_cast<double*>(&X[k + n]));
__m128d t1 = _mm_add_pd(x, t);
__m128d t2 = _mm_sub_pd(x, t);
_mm_storeu_pd(reinterpret_cast<double*>(&X[k + n]), t1);
_mm_storeu_pd(reinterpret_cast<double*>(&X[k + n + N_stage / 2]), t2);


That takes it from 1.85 million µs down to around 1.6 million µs on my PC.






share|improve this answer











$endgroup$














  • $begingroup$
    Is there a way that doesn't involve preprocessor tricks to make this cross platform compatible? I'd prefer to avoid using the preprocessor although it's not completely out of the question.
    $endgroup$
    – tjwrona1992
    8 hours ago










  • $begingroup$
    @tjwrona1992 I added an alternative. You could try one of the other permutation algorithms to avoid this problem altogether.
    $endgroup$
    – harold
    8 hours ago










  • $begingroup$
    Thanks this all looks like very useful advice! I'll try to take advantage of some of it and let you know how it goes. Out of curiosity, how did you know the compiler didn't factor out the duplicated code?
    $endgroup$
    – tjwrona1992
    8 hours ago










  • $begingroup$
    @tjwrona1992 by proof-reading the assembly code, painful as it was
    $endgroup$
    – harold
    8 hours ago










  • $begingroup$
    Ouch... hahaha. You sir, are a boss.
    $endgroup$
    – tjwrona1992
    8 hours ago


















2












$begingroup$

You should be able to work inplace by utilizing the fact that std::complex has a predefined layout. Therefore you can actually cast between an array of double and an array of (half as many) complex numbers.



std::vector<std::complex<double>> a(10);
double *b = reinterpret_cast<double *>(a.data());


EDIT:



To be more clear I would write



span<std::complex<double>> x_p(reinterpret_cast<std::complex<double>*>(x.data()), x.size() / 2);


This works in both ways. To enable safe and modern features you should use a span object. Unfortunately std::span is only available in C++20 so you should either write your own (which is a nice exercise) or have a look at abseil::span or gsl::span.



The code to implement those is rather minimal. With that you can remove two copies from your code






share|improve this answer











$endgroup$














  • $begingroup$
    Are you saying replace vector with span? I haven't used span before. I'll look into it!
    $endgroup$
    – tjwrona1992
    8 hours ago










  • $begingroup$
    No, A span is non-owning. But rather than copying the data into x_p you should create a span that reinterprests the data in x as an array of std::complex
    $endgroup$
    – miscco
    7 hours ago










  • $begingroup$
    Ahh I see, maybe I will implement another one that does it in place like that. But I would also like to have the option to do it out of place as well
    $endgroup$
    – tjwrona1992
    3 hours ago










  • $begingroup$
    Do any compilers already support C++20? I didn't think it was officially released yet
    $endgroup$
    – tjwrona1992
    47 mins ago













Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f226323%2fradix2-fast-fourier-transform-implemented-in-c%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









3












$begingroup$

Putting this through the built-in profiler reveals some hot spots. Perhaps surprisingly: ReverseBits. It's not the biggest thing in the list, but it is significant while it shouldn't be.



You could use one of the many alternate ways to implement ReverseBits, or the sequence of bit-reversed indexes (which does not require reversing all the indexes), or the overall bit-reversal permutation (which does not require bit reversals).



For example here is a way to compute the sequence of bit-reversed indexes without explicitly reversing any index:



for (size_t n = 0, rev = 0; n < N; ++n)

X[n] = x[rev];
size_t change = n ^ (n + 1);
#if _WIN64
rev ^= change << (__lzcnt64(change) - (64 - stages));
#else
rev ^= change << (__lzcnt(change) - (32 - stages));
#endif



On my PC, that reduces the time from around 2.8 million microseconds to 2.3 million microseconds.



This trick works by using that the XOR between adjacent indexes is a mask of ones up to and including the least significant zero (the +1 borrows through the least significant set bits and into that least significant zero), which has a form that can be reversed by just shifting it. The reversed mask is then the XOR between adjacent reversed indexes, so applying it to the current reversed index with XOR increments it.



__lzcnt64 and _WIN64 are for MSVC, you could use more preprocessor tricks to find the right intrinsic and bitness-detection for the current compiler. Leading zero count can be avoided by using std::bitset and its count method:



size_t change = n ^ (n + 1);
std::bitset<64> bits(~change);
rev ^= change << (bits.count() - (64 - stages));


count is recognized by GCC and Clang as an intrinsic for popcnt, but it seems not by MSVC, so it is not reliable for high performance scenarios.



Secondly, there is a repeated expression: W[n * W_offset] * X[k + n + N_stage / 2]. The compiler is often relied on to remove such duplication, but here it didn't happen. Factoring that out reduced the time to under 2 million microseconds.



Computing the twiddle factors takes a bit more time than it needs to. They are powers of the first non-trivial twiddle factor, and could be computed iteratively that way. This suffers from some build-up of inaccuracy, which could be improved by periodically resetting to the proper value computed by std::polar. For example,



auto twiddle_step = std::polar(1.0, -2.0 * M_PI / N);
auto twiddle_current = std::polar(1.0, 0.0);
for (size_t k = 0; k < N / 2; ++k)

if ((k & 0xFFF) == 0)
twiddle_current = std::polar(1.0, -2.0 * M_PI * k / N);
W[k] = twiddle_current;
twiddle_current *= twiddle_step;

// The N/2-point complex DFT uses only the even twiddle factors
if (k % 2 == 0)

W_p[k / 2] = W[k];




On my PC that reduces the time from hovering around 1.95 million µs to around 1.85 million µs, not a huge difference but easily measurable.



More advanced: use SSE3 for the main calculation, for example (not well tested, but seems to work so far)



__m128d w_real = _mm_set1_pd(W[n * W_offset].real());
__m128d w_imag = _mm_set1_pd(W[n * W_offset].imag());
__m128d z = _mm_loadu_pd(reinterpret_cast<double*>(&X[k + n + N_stage / 2]));
__m128d z_rev = _mm_shuffle_pd(z, z, 1);
__m128d t = _mm_addsub_pd(_mm_mul_pd(w_real, z), _mm_mul_pd(w_imag, z_rev));

__m128d x = _mm_loadu_pd(reinterpret_cast<double*>(&X[k + n]));
__m128d t1 = _mm_add_pd(x, t);
__m128d t2 = _mm_sub_pd(x, t);
_mm_storeu_pd(reinterpret_cast<double*>(&X[k + n]), t1);
_mm_storeu_pd(reinterpret_cast<double*>(&X[k + n + N_stage / 2]), t2);


That takes it from 1.85 million µs down to around 1.6 million µs on my PC.






share|improve this answer











$endgroup$














  • $begingroup$
    Is there a way that doesn't involve preprocessor tricks to make this cross platform compatible? I'd prefer to avoid using the preprocessor although it's not completely out of the question.
    $endgroup$
    – tjwrona1992
    8 hours ago










  • $begingroup$
    @tjwrona1992 I added an alternative. You could try one of the other permutation algorithms to avoid this problem altogether.
    $endgroup$
    – harold
    8 hours ago










  • $begingroup$
    Thanks this all looks like very useful advice! I'll try to take advantage of some of it and let you know how it goes. Out of curiosity, how did you know the compiler didn't factor out the duplicated code?
    $endgroup$
    – tjwrona1992
    8 hours ago










  • $begingroup$
    @tjwrona1992 by proof-reading the assembly code, painful as it was
    $endgroup$
    – harold
    8 hours ago










  • $begingroup$
    Ouch... hahaha. You sir, are a boss.
    $endgroup$
    – tjwrona1992
    8 hours ago















3












$begingroup$

Putting this through the built-in profiler reveals some hot spots. Perhaps surprisingly: ReverseBits. It's not the biggest thing in the list, but it is significant while it shouldn't be.



You could use one of the many alternate ways to implement ReverseBits, or the sequence of bit-reversed indexes (which does not require reversing all the indexes), or the overall bit-reversal permutation (which does not require bit reversals).



For example here is a way to compute the sequence of bit-reversed indexes without explicitly reversing any index:



for (size_t n = 0, rev = 0; n < N; ++n)

X[n] = x[rev];
size_t change = n ^ (n + 1);
#if _WIN64
rev ^= change << (__lzcnt64(change) - (64 - stages));
#else
rev ^= change << (__lzcnt(change) - (32 - stages));
#endif



On my PC, that reduces the time from around 2.8 million microseconds to 2.3 million microseconds.



This trick works by using that the XOR between adjacent indexes is a mask of ones up to and including the least significant zero (the +1 borrows through the least significant set bits and into that least significant zero), which has a form that can be reversed by just shifting it. The reversed mask is then the XOR between adjacent reversed indexes, so applying it to the current reversed index with XOR increments it.



__lzcnt64 and _WIN64 are for MSVC, you could use more preprocessor tricks to find the right intrinsic and bitness-detection for the current compiler. Leading zero count can be avoided by using std::bitset and its count method:



size_t change = n ^ (n + 1);
std::bitset<64> bits(~change);
rev ^= change << (bits.count() - (64 - stages));


count is recognized by GCC and Clang as an intrinsic for popcnt, but it seems not by MSVC, so it is not reliable for high performance scenarios.



Secondly, there is a repeated expression: W[n * W_offset] * X[k + n + N_stage / 2]. The compiler is often relied on to remove such duplication, but here it didn't happen. Factoring that out reduced the time to under 2 million microseconds.



Computing the twiddle factors takes a bit more time than it needs to. They are powers of the first non-trivial twiddle factor, and could be computed iteratively that way. This suffers from some build-up of inaccuracy, which could be improved by periodically resetting to the proper value computed by std::polar. For example,



auto twiddle_step = std::polar(1.0, -2.0 * M_PI / N);
auto twiddle_current = std::polar(1.0, 0.0);
for (size_t k = 0; k < N / 2; ++k)

if ((k & 0xFFF) == 0)
twiddle_current = std::polar(1.0, -2.0 * M_PI * k / N);
W[k] = twiddle_current;
twiddle_current *= twiddle_step;

// The N/2-point complex DFT uses only the even twiddle factors
if (k % 2 == 0)

W_p[k / 2] = W[k];




On my PC that reduces the time from hovering around 1.95 million µs to around 1.85 million µs, not a huge difference but easily measurable.



More advanced: use SSE3 for the main calculation, for example (not well tested, but seems to work so far)



__m128d w_real = _mm_set1_pd(W[n * W_offset].real());
__m128d w_imag = _mm_set1_pd(W[n * W_offset].imag());
__m128d z = _mm_loadu_pd(reinterpret_cast<double*>(&X[k + n + N_stage / 2]));
__m128d z_rev = _mm_shuffle_pd(z, z, 1);
__m128d t = _mm_addsub_pd(_mm_mul_pd(w_real, z), _mm_mul_pd(w_imag, z_rev));

__m128d x = _mm_loadu_pd(reinterpret_cast<double*>(&X[k + n]));
__m128d t1 = _mm_add_pd(x, t);
__m128d t2 = _mm_sub_pd(x, t);
_mm_storeu_pd(reinterpret_cast<double*>(&X[k + n]), t1);
_mm_storeu_pd(reinterpret_cast<double*>(&X[k + n + N_stage / 2]), t2);


That takes it from 1.85 million µs down to around 1.6 million µs on my PC.






share|improve this answer











$endgroup$














  • $begingroup$
    Is there a way that doesn't involve preprocessor tricks to make this cross platform compatible? I'd prefer to avoid using the preprocessor although it's not completely out of the question.
    $endgroup$
    – tjwrona1992
    8 hours ago










  • $begingroup$
    @tjwrona1992 I added an alternative. You could try one of the other permutation algorithms to avoid this problem altogether.
    $endgroup$
    – harold
    8 hours ago










  • $begingroup$
    Thanks this all looks like very useful advice! I'll try to take advantage of some of it and let you know how it goes. Out of curiosity, how did you know the compiler didn't factor out the duplicated code?
    $endgroup$
    – tjwrona1992
    8 hours ago










  • $begingroup$
    @tjwrona1992 by proof-reading the assembly code, painful as it was
    $endgroup$
    – harold
    8 hours ago










  • $begingroup$
    Ouch... hahaha. You sir, are a boss.
    $endgroup$
    – tjwrona1992
    8 hours ago













3












3








3





$begingroup$

Putting this through the built-in profiler reveals some hot spots. Perhaps surprisingly: ReverseBits. It's not the biggest thing in the list, but it is significant while it shouldn't be.



You could use one of the many alternate ways to implement ReverseBits, or the sequence of bit-reversed indexes (which does not require reversing all the indexes), or the overall bit-reversal permutation (which does not require bit reversals).



For example here is a way to compute the sequence of bit-reversed indexes without explicitly reversing any index:



for (size_t n = 0, rev = 0; n < N; ++n)

X[n] = x[rev];
size_t change = n ^ (n + 1);
#if _WIN64
rev ^= change << (__lzcnt64(change) - (64 - stages));
#else
rev ^= change << (__lzcnt(change) - (32 - stages));
#endif



On my PC, that reduces the time from around 2.8 million microseconds to 2.3 million microseconds.



This trick works by using that the XOR between adjacent indexes is a mask of ones up to and including the least significant zero (the +1 borrows through the least significant set bits and into that least significant zero), which has a form that can be reversed by just shifting it. The reversed mask is then the XOR between adjacent reversed indexes, so applying it to the current reversed index with XOR increments it.



__lzcnt64 and _WIN64 are for MSVC, you could use more preprocessor tricks to find the right intrinsic and bitness-detection for the current compiler. Leading zero count can be avoided by using std::bitset and its count method:



size_t change = n ^ (n + 1);
std::bitset<64> bits(~change);
rev ^= change << (bits.count() - (64 - stages));


count is recognized by GCC and Clang as an intrinsic for popcnt, but it seems not by MSVC, so it is not reliable for high performance scenarios.



Secondly, there is a repeated expression: W[n * W_offset] * X[k + n + N_stage / 2]. The compiler is often relied on to remove such duplication, but here it didn't happen. Factoring that out reduced the time to under 2 million microseconds.



Computing the twiddle factors takes a bit more time than it needs to. They are powers of the first non-trivial twiddle factor, and could be computed iteratively that way. This suffers from some build-up of inaccuracy, which could be improved by periodically resetting to the proper value computed by std::polar. For example,



auto twiddle_step = std::polar(1.0, -2.0 * M_PI / N);
auto twiddle_current = std::polar(1.0, 0.0);
for (size_t k = 0; k < N / 2; ++k)

if ((k & 0xFFF) == 0)
twiddle_current = std::polar(1.0, -2.0 * M_PI * k / N);
W[k] = twiddle_current;
twiddle_current *= twiddle_step;

// The N/2-point complex DFT uses only the even twiddle factors
if (k % 2 == 0)

W_p[k / 2] = W[k];




On my PC that reduces the time from hovering around 1.95 million µs to around 1.85 million µs, not a huge difference but easily measurable.



More advanced: use SSE3 for the main calculation, for example (not well tested, but seems to work so far)



__m128d w_real = _mm_set1_pd(W[n * W_offset].real());
__m128d w_imag = _mm_set1_pd(W[n * W_offset].imag());
__m128d z = _mm_loadu_pd(reinterpret_cast<double*>(&X[k + n + N_stage / 2]));
__m128d z_rev = _mm_shuffle_pd(z, z, 1);
__m128d t = _mm_addsub_pd(_mm_mul_pd(w_real, z), _mm_mul_pd(w_imag, z_rev));

__m128d x = _mm_loadu_pd(reinterpret_cast<double*>(&X[k + n]));
__m128d t1 = _mm_add_pd(x, t);
__m128d t2 = _mm_sub_pd(x, t);
_mm_storeu_pd(reinterpret_cast<double*>(&X[k + n]), t1);
_mm_storeu_pd(reinterpret_cast<double*>(&X[k + n + N_stage / 2]), t2);


That takes it from 1.85 million µs down to around 1.6 million µs on my PC.






share|improve this answer











$endgroup$



Putting this through the built-in profiler reveals some hot spots. Perhaps surprisingly: ReverseBits. It's not the biggest thing in the list, but it is significant while it shouldn't be.



You could use one of the many alternate ways to implement ReverseBits, or the sequence of bit-reversed indexes (which does not require reversing all the indexes), or the overall bit-reversal permutation (which does not require bit reversals).



For example here is a way to compute the sequence of bit-reversed indexes without explicitly reversing any index:



for (size_t n = 0, rev = 0; n < N; ++n)

X[n] = x[rev];
size_t change = n ^ (n + 1);
#if _WIN64
rev ^= change << (__lzcnt64(change) - (64 - stages));
#else
rev ^= change << (__lzcnt(change) - (32 - stages));
#endif



On my PC, that reduces the time from around 2.8 million microseconds to 2.3 million microseconds.



This trick works by using that the XOR between adjacent indexes is a mask of ones up to and including the least significant zero (the +1 borrows through the least significant set bits and into that least significant zero), which has a form that can be reversed by just shifting it. The reversed mask is then the XOR between adjacent reversed indexes, so applying it to the current reversed index with XOR increments it.



__lzcnt64 and _WIN64 are for MSVC, you could use more preprocessor tricks to find the right intrinsic and bitness-detection for the current compiler. Leading zero count can be avoided by using std::bitset and its count method:



size_t change = n ^ (n + 1);
std::bitset<64> bits(~change);
rev ^= change << (bits.count() - (64 - stages));


count is recognized by GCC and Clang as an intrinsic for popcnt, but it seems not by MSVC, so it is not reliable for high performance scenarios.



Secondly, there is a repeated expression: W[n * W_offset] * X[k + n + N_stage / 2]. The compiler is often relied on to remove such duplication, but here it didn't happen. Factoring that out reduced the time to under 2 million microseconds.



Computing the twiddle factors takes a bit more time than it needs to. They are powers of the first non-trivial twiddle factor, and could be computed iteratively that way. This suffers from some build-up of inaccuracy, which could be improved by periodically resetting to the proper value computed by std::polar. For example,



auto twiddle_step = std::polar(1.0, -2.0 * M_PI / N);
auto twiddle_current = std::polar(1.0, 0.0);
for (size_t k = 0; k < N / 2; ++k)

if ((k & 0xFFF) == 0)
twiddle_current = std::polar(1.0, -2.0 * M_PI * k / N);
W[k] = twiddle_current;
twiddle_current *= twiddle_step;

// The N/2-point complex DFT uses only the even twiddle factors
if (k % 2 == 0)

W_p[k / 2] = W[k];




On my PC that reduces the time from hovering around 1.95 million µs to around 1.85 million µs, not a huge difference but easily measurable.



More advanced: use SSE3 for the main calculation, for example (not well tested, but seems to work so far)



__m128d w_real = _mm_set1_pd(W[n * W_offset].real());
__m128d w_imag = _mm_set1_pd(W[n * W_offset].imag());
__m128d z = _mm_loadu_pd(reinterpret_cast<double*>(&X[k + n + N_stage / 2]));
__m128d z_rev = _mm_shuffle_pd(z, z, 1);
__m128d t = _mm_addsub_pd(_mm_mul_pd(w_real, z), _mm_mul_pd(w_imag, z_rev));

__m128d x = _mm_loadu_pd(reinterpret_cast<double*>(&X[k + n]));
__m128d t1 = _mm_add_pd(x, t);
__m128d t2 = _mm_sub_pd(x, t);
_mm_storeu_pd(reinterpret_cast<double*>(&X[k + n]), t1);
_mm_storeu_pd(reinterpret_cast<double*>(&X[k + n + N_stage / 2]), t2);


That takes it from 1.85 million µs down to around 1.6 million µs on my PC.







share|improve this answer














share|improve this answer



share|improve this answer








edited 53 mins ago

























answered 8 hours ago









haroldharold

2,2418 silver badges12 bronze badges




2,2418 silver badges12 bronze badges














  • $begingroup$
    Is there a way that doesn't involve preprocessor tricks to make this cross platform compatible? I'd prefer to avoid using the preprocessor although it's not completely out of the question.
    $endgroup$
    – tjwrona1992
    8 hours ago










  • $begingroup$
    @tjwrona1992 I added an alternative. You could try one of the other permutation algorithms to avoid this problem altogether.
    $endgroup$
    – harold
    8 hours ago










  • $begingroup$
    Thanks this all looks like very useful advice! I'll try to take advantage of some of it and let you know how it goes. Out of curiosity, how did you know the compiler didn't factor out the duplicated code?
    $endgroup$
    – tjwrona1992
    8 hours ago










  • $begingroup$
    @tjwrona1992 by proof-reading the assembly code, painful as it was
    $endgroup$
    – harold
    8 hours ago










  • $begingroup$
    Ouch... hahaha. You sir, are a boss.
    $endgroup$
    – tjwrona1992
    8 hours ago
















  • $begingroup$
    Is there a way that doesn't involve preprocessor tricks to make this cross platform compatible? I'd prefer to avoid using the preprocessor although it's not completely out of the question.
    $endgroup$
    – tjwrona1992
    8 hours ago










  • $begingroup$
    @tjwrona1992 I added an alternative. You could try one of the other permutation algorithms to avoid this problem altogether.
    $endgroup$
    – harold
    8 hours ago










  • $begingroup$
    Thanks this all looks like very useful advice! I'll try to take advantage of some of it and let you know how it goes. Out of curiosity, how did you know the compiler didn't factor out the duplicated code?
    $endgroup$
    – tjwrona1992
    8 hours ago










  • $begingroup$
    @tjwrona1992 by proof-reading the assembly code, painful as it was
    $endgroup$
    – harold
    8 hours ago










  • $begingroup$
    Ouch... hahaha. You sir, are a boss.
    $endgroup$
    – tjwrona1992
    8 hours ago















$begingroup$
Is there a way that doesn't involve preprocessor tricks to make this cross platform compatible? I'd prefer to avoid using the preprocessor although it's not completely out of the question.
$endgroup$
– tjwrona1992
8 hours ago




$begingroup$
Is there a way that doesn't involve preprocessor tricks to make this cross platform compatible? I'd prefer to avoid using the preprocessor although it's not completely out of the question.
$endgroup$
– tjwrona1992
8 hours ago












$begingroup$
@tjwrona1992 I added an alternative. You could try one of the other permutation algorithms to avoid this problem altogether.
$endgroup$
– harold
8 hours ago




$begingroup$
@tjwrona1992 I added an alternative. You could try one of the other permutation algorithms to avoid this problem altogether.
$endgroup$
– harold
8 hours ago












$begingroup$
Thanks this all looks like very useful advice! I'll try to take advantage of some of it and let you know how it goes. Out of curiosity, how did you know the compiler didn't factor out the duplicated code?
$endgroup$
– tjwrona1992
8 hours ago




$begingroup$
Thanks this all looks like very useful advice! I'll try to take advantage of some of it and let you know how it goes. Out of curiosity, how did you know the compiler didn't factor out the duplicated code?
$endgroup$
– tjwrona1992
8 hours ago












$begingroup$
@tjwrona1992 by proof-reading the assembly code, painful as it was
$endgroup$
– harold
8 hours ago




$begingroup$
@tjwrona1992 by proof-reading the assembly code, painful as it was
$endgroup$
– harold
8 hours ago












$begingroup$
Ouch... hahaha. You sir, are a boss.
$endgroup$
– tjwrona1992
8 hours ago




$begingroup$
Ouch... hahaha. You sir, are a boss.
$endgroup$
– tjwrona1992
8 hours ago













2












$begingroup$

You should be able to work inplace by utilizing the fact that std::complex has a predefined layout. Therefore you can actually cast between an array of double and an array of (half as many) complex numbers.



std::vector<std::complex<double>> a(10);
double *b = reinterpret_cast<double *>(a.data());


EDIT:



To be more clear I would write



span<std::complex<double>> x_p(reinterpret_cast<std::complex<double>*>(x.data()), x.size() / 2);


This works in both ways. To enable safe and modern features you should use a span object. Unfortunately std::span is only available in C++20 so you should either write your own (which is a nice exercise) or have a look at abseil::span or gsl::span.



The code to implement those is rather minimal. With that you can remove two copies from your code






share|improve this answer











$endgroup$














  • $begingroup$
    Are you saying replace vector with span? I haven't used span before. I'll look into it!
    $endgroup$
    – tjwrona1992
    8 hours ago










  • $begingroup$
    No, A span is non-owning. But rather than copying the data into x_p you should create a span that reinterprests the data in x as an array of std::complex
    $endgroup$
    – miscco
    7 hours ago










  • $begingroup$
    Ahh I see, maybe I will implement another one that does it in place like that. But I would also like to have the option to do it out of place as well
    $endgroup$
    – tjwrona1992
    3 hours ago










  • $begingroup$
    Do any compilers already support C++20? I didn't think it was officially released yet
    $endgroup$
    – tjwrona1992
    47 mins ago















2












$begingroup$

You should be able to work inplace by utilizing the fact that std::complex has a predefined layout. Therefore you can actually cast between an array of double and an array of (half as many) complex numbers.



std::vector<std::complex<double>> a(10);
double *b = reinterpret_cast<double *>(a.data());


EDIT:



To be more clear I would write



span<std::complex<double>> x_p(reinterpret_cast<std::complex<double>*>(x.data()), x.size() / 2);


This works in both ways. To enable safe and modern features you should use a span object. Unfortunately std::span is only available in C++20 so you should either write your own (which is a nice exercise) or have a look at abseil::span or gsl::span.



The code to implement those is rather minimal. With that you can remove two copies from your code






share|improve this answer











$endgroup$














  • $begingroup$
    Are you saying replace vector with span? I haven't used span before. I'll look into it!
    $endgroup$
    – tjwrona1992
    8 hours ago










  • $begingroup$
    No, A span is non-owning. But rather than copying the data into x_p you should create a span that reinterprests the data in x as an array of std::complex
    $endgroup$
    – miscco
    7 hours ago










  • $begingroup$
    Ahh I see, maybe I will implement another one that does it in place like that. But I would also like to have the option to do it out of place as well
    $endgroup$
    – tjwrona1992
    3 hours ago










  • $begingroup$
    Do any compilers already support C++20? I didn't think it was officially released yet
    $endgroup$
    – tjwrona1992
    47 mins ago













2












2








2





$begingroup$

You should be able to work inplace by utilizing the fact that std::complex has a predefined layout. Therefore you can actually cast between an array of double and an array of (half as many) complex numbers.



std::vector<std::complex<double>> a(10);
double *b = reinterpret_cast<double *>(a.data());


EDIT:



To be more clear I would write



span<std::complex<double>> x_p(reinterpret_cast<std::complex<double>*>(x.data()), x.size() / 2);


This works in both ways. To enable safe and modern features you should use a span object. Unfortunately std::span is only available in C++20 so you should either write your own (which is a nice exercise) or have a look at abseil::span or gsl::span.



The code to implement those is rather minimal. With that you can remove two copies from your code






share|improve this answer











$endgroup$



You should be able to work inplace by utilizing the fact that std::complex has a predefined layout. Therefore you can actually cast between an array of double and an array of (half as many) complex numbers.



std::vector<std::complex<double>> a(10);
double *b = reinterpret_cast<double *>(a.data());


EDIT:



To be more clear I would write



span<std::complex<double>> x_p(reinterpret_cast<std::complex<double>*>(x.data()), x.size() / 2);


This works in both ways. To enable safe and modern features you should use a span object. Unfortunately std::span is only available in C++20 so you should either write your own (which is a nice exercise) or have a look at abseil::span or gsl::span.



The code to implement those is rather minimal. With that you can remove two copies from your code







share|improve this answer














share|improve this answer



share|improve this answer








edited 7 hours ago

























answered 8 hours ago









misccomiscco

3,6346 silver badges17 bronze badges




3,6346 silver badges17 bronze badges














  • $begingroup$
    Are you saying replace vector with span? I haven't used span before. I'll look into it!
    $endgroup$
    – tjwrona1992
    8 hours ago










  • $begingroup$
    No, A span is non-owning. But rather than copying the data into x_p you should create a span that reinterprests the data in x as an array of std::complex
    $endgroup$
    – miscco
    7 hours ago










  • $begingroup$
    Ahh I see, maybe I will implement another one that does it in place like that. But I would also like to have the option to do it out of place as well
    $endgroup$
    – tjwrona1992
    3 hours ago










  • $begingroup$
    Do any compilers already support C++20? I didn't think it was officially released yet
    $endgroup$
    – tjwrona1992
    47 mins ago
















  • $begingroup$
    Are you saying replace vector with span? I haven't used span before. I'll look into it!
    $endgroup$
    – tjwrona1992
    8 hours ago










  • $begingroup$
    No, A span is non-owning. But rather than copying the data into x_p you should create a span that reinterprests the data in x as an array of std::complex
    $endgroup$
    – miscco
    7 hours ago










  • $begingroup$
    Ahh I see, maybe I will implement another one that does it in place like that. But I would also like to have the option to do it out of place as well
    $endgroup$
    – tjwrona1992
    3 hours ago










  • $begingroup$
    Do any compilers already support C++20? I didn't think it was officially released yet
    $endgroup$
    – tjwrona1992
    47 mins ago















$begingroup$
Are you saying replace vector with span? I haven't used span before. I'll look into it!
$endgroup$
– tjwrona1992
8 hours ago




$begingroup$
Are you saying replace vector with span? I haven't used span before. I'll look into it!
$endgroup$
– tjwrona1992
8 hours ago












$begingroup$
No, A span is non-owning. But rather than copying the data into x_p you should create a span that reinterprests the data in x as an array of std::complex
$endgroup$
– miscco
7 hours ago




$begingroup$
No, A span is non-owning. But rather than copying the data into x_p you should create a span that reinterprests the data in x as an array of std::complex
$endgroup$
– miscco
7 hours ago












$begingroup$
Ahh I see, maybe I will implement another one that does it in place like that. But I would also like to have the option to do it out of place as well
$endgroup$
– tjwrona1992
3 hours ago




$begingroup$
Ahh I see, maybe I will implement another one that does it in place like that. But I would also like to have the option to do it out of place as well
$endgroup$
– tjwrona1992
3 hours ago












$begingroup$
Do any compilers already support C++20? I didn't think it was officially released yet
$endgroup$
– tjwrona1992
47 mins ago




$begingroup$
Do any compilers already support C++20? I didn't think it was officially released yet
$endgroup$
– tjwrona1992
47 mins ago

















draft saved

draft discarded
















































Thanks for contributing an answer to Code Review Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f226323%2fradix2-fast-fourier-transform-implemented-in-c%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

19. јануар Садржај Догађаји Рођења Смрти Празници и дани сећања Види још Референце Мени за навигацијуу

Israel Cuprins Etimologie | Istorie | Geografie | Politică | Demografie | Educație | Economie | Cultură | Note explicative | Note bibliografice | Bibliografie | Legături externe | Meniu de navigaresite web oficialfacebooktweeterGoogle+Instagramcanal YouTubeInstagramtextmodificaremodificarewww.technion.ac.ilnew.huji.ac.ilwww.weizmann.ac.ilwww1.biu.ac.ilenglish.tau.ac.ilwww.haifa.ac.ilin.bgu.ac.ilwww.openu.ac.ilwww.ariel.ac.ilCIA FactbookHarta Israelului"Negotiating Jerusalem," Palestine–Israel JournalThe Schizoid Nature of Modern Hebrew: A Slavic Language in Search of a Semitic Past„Arabic in Israel: an official language and a cultural bridge”„Latest Population Statistics for Israel”„Israel Population”„Tables”„Report for Selected Countries and Subjects”Human Development Report 2016: Human Development for Everyone„Distribution of family income - Gini index”The World FactbookJerusalem Law„Israel”„Israel”„Zionist Leaders: David Ben-Gurion 1886–1973”„The status of Jerusalem”„Analysis: Kadima's big plans”„Israel's Hard-Learned Lessons”„The Legacy of Undefined Borders, Tel Aviv Notes No. 40, 5 iunie 2002”„Israel Journal: A Land Without Borders”„Population”„Israel closes decade with population of 7.5 million”Time Series-DataBank„Selected Statistics on Jerusalem Day 2007 (Hebrew)”Golan belongs to Syria, Druze protestGlobal Survey 2006: Middle East Progress Amid Global Gains in FreedomWHO: Life expectancy in Israel among highest in the worldInternational Monetary Fund, World Economic Outlook Database, April 2011: Nominal GDP list of countries. Data for the year 2010.„Israel's accession to the OECD”Popular Opinion„On the Move”Hosea 12:5„Walking the Bible Timeline”„Palestine: History”„Return to Zion”An invention called 'the Jewish people' – Haaretz – Israel NewsoriginalJewish and Non-Jewish Population of Palestine-Israel (1517–2004)ImmigrationJewishvirtuallibrary.orgChapter One: The Heralders of Zionism„The birth of modern Israel: A scrap of paper that changed history”„League of Nations: The Mandate for Palestine, 24 iulie 1922”The Population of Palestine Prior to 1948originalBackground Paper No. 47 (ST/DPI/SER.A/47)History: Foreign DominationTwo Hundred and Seventh Plenary Meeting„Israel (Labor Zionism)”Population, by Religion and Population GroupThe Suez CrisisAdolf EichmannJustice Ministry Reply to Amnesty International Report„The Interregnum”Israel Ministry of Foreign Affairs – The Palestinian National Covenant- July 1968Research on terrorism: trends, achievements & failuresThe Routledge Atlas of the Arab–Israeli conflict: The Complete History of the Struggle and the Efforts to Resolve It"George Habash, Palestinian Terrorism Tactician, Dies at 82."„1973: Arab states attack Israeli forces”Agranat Commission„Has Israel Annexed East Jerusalem?”original„After 4 Years, Intifada Still Smolders”From the End of the Cold War to 2001originalThe Oslo Accords, 1993Israel-PLO Recognition – Exchange of Letters between PM Rabin and Chairman Arafat – Sept 9- 1993Foundation for Middle East PeaceSources of Population Growth: Total Israeli Population and Settler Population, 1991–2003original„Israel marks Rabin assassination”The Wye River Memorandumoriginal„West Bank barrier route disputed, Israeli missile kills 2”"Permanent Ceasefire to Be Based on Creation Of Buffer Zone Free of Armed Personnel Other than UN, Lebanese Forces"„Hezbollah kills 8 soldiers, kidnaps two in offensive on northern border”„Olmert confirms peace talks with Syria”„Battleground Gaza: Israeli ground forces invade the strip”„IDF begins Gaza troop withdrawal, hours after ending 3-week offensive”„THE LAND: Geography and Climate”„Area of districts, sub-districts, natural regions and lakes”„Israel - Geography”„Makhteshim Country”Israel and the Palestinian Territories„Makhtesh Ramon”„The Living Dead Sea”„Temperatures reach record high in Pakistan”„Climate Extremes In Israel”Israel in figures„Deuteronom”„JNF: 240 million trees planted since 1901”„Vegetation of Israel and Neighboring Countries”Environmental Law in Israel„Executive branch”„Israel's election process explained”„The Electoral System in Israel”„Constitution for Israel”„All 120 incoming Knesset members”„Statul ISRAEL”„The Judiciary: The Court System”„Israel's high court unique in region”„Israel and the International Criminal Court: A Legal Battlefield”„Localities and population, by population group, district, sub-district and natural region”„Israel: Districts, Major Cities, Urban Localities & Metropolitan Areas”„Israel-Egypt Relations: Background & Overview of Peace Treaty”„Solana to Haaretz: New Rules of War Needed for Age of Terror”„Israel's Announcement Regarding Settlements”„United Nations Security Council Resolution 497”„Security Council resolution 478 (1980) on the status of Jerusalem”„Arabs will ask U.N. to seek razing of Israeli wall”„Olmert: Willing to trade land for peace”„Mapping Peace between Syria and Israel”„Egypt: Israel must accept the land-for-peace formula”„Israel: Age structure from 2005 to 2015”„Global, regional, and national disability-adjusted life years (DALYs) for 306 diseases and injuries and healthy life expectancy (HALE) for 188 countries, 1990–2013: quantifying the epidemiological transition”10.1016/S0140-6736(15)61340-X„World Health Statistics 2014”„Life expectancy for Israeli men world's 4th highest”„Family Structure and Well-Being Across Israel's Diverse Population”„Fertility among Jewish and Muslim Women in Israel, by Level of Religiosity, 1979-2009”„Israel leaders in birth rate, but poverty major challenge”„Ethnic Groups”„Israel's population: Over 8.5 million”„Israel - Ethnic groups”„Jews, by country of origin and age”„Minority Communities in Israel: Background & Overview”„Israel”„Language in Israel”„Selected Data from the 2011 Social Survey on Mastery of the Hebrew Language and Usage of Languages”„Religions”„5 facts about Israeli Druze, a unique religious and ethnic group”„Israël”Israel Country Study Guide„Haredi city in Negev – blessing or curse?”„New town Harish harbors hopes of being more than another Pleasantville”„List of localities, in alphabetical order”„Muncitorii români, doriți în Israel”„Prietenia româno-israeliană la nevoie se cunoaște”„The Higher Education System in Israel”„Middle East”„Academic Ranking of World Universities 2016”„Israel”„Israel”„Jewish Nobel Prize Winners”„All Nobel Prizes in Literature”„All Nobel Peace Prizes”„All Prizes in Economic Sciences”„All Nobel Prizes in Chemistry”„List of Fields Medallists”„Sakharov Prize”„Țara care și-a sfidat "destinul" și se bate umăr la umăr cu Silicon Valley”„Apple's R&D center in Israel grew to about 800 employees”„Tim Cook: Apple's Herzliya R&D center second-largest in world”„Lecții de economie de la Israel”„Land use”Israel Investment and Business GuideA Country Study: IsraelCentral Bureau of StatisticsFlorin Diaconu, „Kadima: Flexibilitate și pragmatism, dar nici un compromis în chestiuni vitale", în Revista Institutului Diplomatic Român, anul I, numărul I, semestrul I, 2006, pp. 71-72Florin Diaconu, „Likud: Dreapta israeliană constant opusă retrocedării teritoriilor cureite prin luptă în 1967", în Revista Institutului Diplomatic Român, anul I, numărul I, semestrul I, 2006, pp. 73-74MassadaIsraelul a crescut in 50 de ani cât alte state intr-un mileniuIsrael Government PortalIsraelIsraelIsraelmmmmmXX451232cb118646298(data)4027808-634110000 0004 0372 0767n7900328503691455-bb46-37e3-91d2-cb064a35ffcc1003570400564274ge1294033523775214929302638955X146498911146498911

Кастелфранко ди Сопра Становништво Референце Спољашње везе Мени за навигацију43°37′18″ СГШ; 11°33′32″ ИГД / 43.62156° СГШ; 11.55885° ИГД / 43.62156; 11.5588543°37′18″ СГШ; 11°33′32″ ИГД / 43.62156° СГШ; 11.55885° ИГД / 43.62156; 11.558853179688„The GeoNames geographical database”„Istituto Nazionale di Statistica”проширитиууWorldCat156923403n850174324558639-1cb14643287r(подаци)