Don't get me wrong - I'm not saying the performance benefit is great. it is. I mean to express that if the speedup is 500x,
For an equivalent comparison at the same resolution JIm K. was operating the speedup factor ratio between Octave and Matlab is 3x - Jim reported about 121 speedup for Matlab and for the same resolution I noticed about ~360x for Octave.
Please see the graph in the link below that shows the time it takes for one of the vectorized filters to operate at various image sizes.
https://www.dpreview.com/forums/post/59077181
the prior art is written in a way that is difficult for the compiler to understand, or the compiler is not well implemented for speed.
Yes, that may be true. However, I don't think it is necessarily the code generation part of the compiler. IMHO, it could be the parser part. In fact it appears from the timing graph in the above link that Octave behaves kind of linearly as image size is increased. At least not nonlinearly high as image sizes increase. I would think that perhaps parsing in the interpreter needs to be optimized more.
Many people don't seem to realize the difference in effective parsing methodologies. That is why I'm a big fan of all CS and even EE students taking the compiler course, which unfortunately at many universities is optional. Not for the code generation part per se, but the parsing methodologies that they teach in this course using powerful tools such as bison/flex, etc. With the advent of big data I feel a need for more effort in parsing methodologies for distributed workflows and data pipelines. However, many students don't realize that and in fact like to skip this very useful course as it is considered difficult.
On a different note, at the heart, Octave, like many other open source GPL'd packages, uses standard high performance libraries such as FFTW3, Lapack, etc. Hence, many parts of their code that runs at native speed on the hardware should be fast enough.
If Matlab is 5x faster than Octave (or so) that would suggest Matlab's compiler is a lot better than the Octave compiler. That suggests matlab > octave for image processing, which is among the most difficult tasks in computing w.r.t. time.
Matlab parser is perhaps faster. Not necessarily code execution. I would expect a commercial company in business for long time to have decent code generating and optimizing compiler.
However, still not a reason to buy Matlab when Octave is free.

The case has to be more compelling.
Of course, matlab is likely much slower than "real code." I once got a non FFT convolution-based image processing algorithm working in C#, verified results and all, which could do 4K monochromatic images almost fast enough for video (~25ms/frame). It used SIMD+multithreading. If I compiled the code and called the functions through matlab's ability to hook into .dll files, it took over 40s and each thread utilized only about 4% of its core. I am not sure how it is possible for it to perform so badly at that.
Interesting.
Years ago I learned Ruby, later Python. Freshmen in optics are required to take a matlab course, so I learned matlab. The language is brilliant for an engineer, as the syntax is fairly "free" and doesn't have much "programmy stuff" like the var keyword, etc.
These days I'm gravitating towards Python. Many of the parser related slowness we are taking about here due to Matlab / Octave being an interpreter-like environment might go away. And, Python comes with other advantages also. However, an issue with Python is that multi-threading is broken. But, so many young programmers don't even realize what that means for them
And, there are some things in Python just unique to that language that are very interesting for newer, cloud-based, distributed, modern data flow-based pipelines.
Another issue is intellectual property. With Matlab (and also R) the scripts themselves are open to anybody. There are ways to get around that but they are more like hacks. Python compiled code should obviate that issue.
If you want speed and don't need to share with these "non programmers" you can always write in C with possibly great pain, or C++ with a bit less pain.
I love C++. But, not enough takers these days. And, QT is absolutely amazing.
These languages are the "1x [execution] time cost" tier in terms of performance. These days there are a number of newer entrants (rust, golang, scala) that compete with the "old hat" C# for the "4x time cost." Google's V8 javascript compiler is also phenomenally good, and has javascript running in comparable speed to C#, at times even faster.
Yes, they did that to have so called 'full stack' front-end to back-end Javascript based developers. So that they could do everything end to end in Javascript. But, Javascript is terrible as a language.
C# has a beautifully simple way to convert for() loops to Parallel.For loops that offers 80%+ the performance of manually managing the threading with nearly zero effort from the programmer. That's a huge win and lets you parallelize damn near everything. I am not so familiar with rust, go, or scala as they are mostly server languages.
C# is a reasonable language. However, IMHO, Python will eventually kill it.

Though, MS Visual Studio has a great IDE. One of the best that I like.
I do know in Javascript there are now ways to spin off more processes for multithreading, but it is a manual process.
They are trying. But you can't convert a heyna (Javascript) into a lion (Python).
Manual multithreading is a pain, and many workloads are very difficult to multithread this way. In the median filter example, you need to send each worker a chunk of the image with some padding (the kernel radius) but only process the chunk without the padding. My eyes are rolling at the thought of writing that code.
It is often said that the matlab compiler takes advantage of SIMD for vectorized functions - see e.g.
http://stackoverflow.com/questions/12615309/how-does-matlab-vectorized-code-work-under-the-hood
You can expect between 4-64x improvement to be due to the ability of the compiler to turn your code into SIMD instructions. Images are usually UInt8s, so I would expect up to 64x from SIMD alone.
Matlab and Octave being very high level languages do not expose any sort of access to SIMD vs non SIMD to the programmer.
What a lot of signal processing people don't realize many times is that the world out there is not alway amenable to parallel code. They are used to
embarrassingly parallel algorithms that are easy to cast into parallel code. But, given the issues in a real distributed computing environment it is not always easy to parallelize stuff as one might think.
And, it is not just GPU computing that is blissfully accommodating of embarrassingly parallel code. Many standard workflows in distributed computing frameworks such as Hadoop and Spark are also geared towards such scenarios. Though, for text based inputs Spark is just killing Hadoop (MapReduce, not the Hadoop env such as HDFS, etc.) But that is a different story.
Assuming you are working 'purely' with variables in RAM, the response time of RAM is in the picosecond domain. It is true that loading millions or billions of bytes can bring this access time up to the millisecond domain, but 1mp images are not large enough that the speed of RAM is likely to be noticeable.
CPU caching (L1, L2 cache, etc. and their sizes) are also an important issue. That is where data layouts that I mentioned before such as (row/column based) can make a difference for some operations.
For image processing unless you need floats I would say speed goes GPU >> SIMD+Multithreaded C#/JS/Go/Rust/Scala >> SIMD Matlab > SIMD Octave (?) >> non-SIMD Matlab >> Octave
(>> being ~5x speedup or more)
Now we have to think about in the totality of distributed systems also that include GPU, etc. as subcomponents in a larger system. World has grown outside computation in a single, isolated environment. See below.
I do wonder if it would actually be faster to implement expensive image processing algorithms in a fast, parallel+simd language and expose them behind a server.
Now we are talking

This is something I've been doing for some time.
If the requests don't leave the machine they should be served (transfer included) in a few milliseconds. Then you could do these things so very quickly with the same accessibility as matlab/octave, but all the speed benefits of faster, "more programm-y" languages.
A lot of secret sauce here. But you are thinking in right direction.