I just attended a session at SD2008 on parallelism put on by an Intel engineer this week.The topic was on increasing executable performance by utilizing threading and Intel's Threading Building Blocks (TBB) library. TBB offers some Template style constructs to easily utilize the CPU's multiple cores. Unfortunately, Intel's focus is on C++ and Fortran (statically typed languages). Pondering over execution speed, I was just thinking that Python programs could sure use a bump. Does anyone know whether Python 2.5 or IronPython makes use of the parallel nature of the newer multi-core processors when threading is used? What about utilizing the multi-core even on single threaded programs?
After jotting this down, I did a search to see if anyone else has posed the same question. These are what I found:
I had done some embedding of the Python interpreter into a C++ executable before and had to learn about the GIL. Apparently Python simulates multithreading by allowing each of its threads to obtain the Global Interpreter Lock. Each thread grabs the lock and runs its allotted time while blocking all other threads. From the OS point-of-view, this is still a single thread being time-shared. Adding CPUs won't help performance because only 1 will be utilized while all others are idle.
Anyone have further thoughts or links to share on this topic?
Linux sockets has this peculiarity that I have run across while developing some protocols. Say you have set up a server which sends replies to client requests. The client connects and would like to read N number of bytes from the server. That N number of bytes may need to be broken up into several recv() calls. Effectively recv() are called until the number of bytes you expect are reached.
The same occurs when you try to send() something. You have to loop until the number of bytes you end up sending completes. I don't know about other people but I find this very counter-intuitive. Having to read the same sort of code again-and-again bugs my brain.
If the only language a person uses is C or C++ then reusing the library won't be a problem but try developing sockets in another language.... say Python. You have to deal with the same weirdness in behavior once again. The pity is that the weirdness is not encapsulated. It leaks into your code each and every time you need to reimplement your protocol.
I am expecting to do lots of socket programming to implement some new protocols in the near future. Having seen the ugliness of the C++ code, I have chosen to use the Python batteries (i.e. socket and thread classes built into the python library) to lessen the amount of code that would have to be written. After a couple of days at it, the code is flowing decently.
Today, I am perusing an O'Reilly book at Barnes and Noble about the Twisted Framework. I have looked at Twisted before but never really paid close attention. After perusing the book, I still have that question on my mind: What does Twisted give us that we can't already get from the python library? FTP, sockets, threading, NNTP, POP3, etc....every protocol that Twisted provides seem to be available as some class already existing in the library. The example programs given (written against the framework), were about the same length as when written with the Python Libraries.
Again, why Twisted?
The recipe book at ActiveState has these recipes which look like they may be useful at one time or another:
* Pyro - python remote object invokation
* Python Recipe for reading Excel tabular data
* Python Recipe: only on change decorator
* Python Recipe: run once decorator
* Python Recipe: Communicate between processes using mmap
Lastly, there is a free .NET interoperability book:
I remember my first computer modem that opened me up to the connected world. It was 300 baud and one of the few bulletin boards which I frequented often was called “The Shark’s Head”. This was in the San Jose area (making sure calls were local was extremely important back then). The Shark’s Head was a place where amateur writers, poets, or poet-wanna-bes like me could share our writing. The technology we used back then was so arcane but the community and the content we produced was extremely rich.
Fast-forward to today. I am helping build and improve Cisco’s routers. 100Mbit connections are the lowest ethernet speed that the machine supports. Middle-line are 1 Gigabit and 10 Gigabit fiber connections. The sheer number of packets which flow here boggles the human imagination. I typically use the IXXIA traffic generating equipment to pump data into the routers for testing. The typical ethernet speeds are 1 Gigabit. Can you imagine the sheer number of data flying by? Often in humourous times, I equate this to shovelling ethernet packets.
Yet with all the vast amounts of bits flying all over the world today (as compared to 10 years ago), have we made better use of our connectivity? My observation says no. In fact, the quality of the content which we have in the virtual world is diminishing every day. There is just so much more junk floating around and going around. I look at my email filter statistics and see that 80% of the mail I receive are SPAM. And it is only increasing as time passes. Are we in the midst of an internet excess?
I have run into a similar problem to the default-allocation waste that is encountered when using the hard-disk. However, as that I am currently working in the data networking arena, that is where it occurs.
First, let me just review the disk-allocation waste situation with those who aren't familiar. When you format your hard-disk, you must choose a specific cluster-size. This size dictates that the smallest unit of disk-space your computer will refer to will have this size. Having a large cluster size is beneficial for large files because the computer doesn't have to index as many blocks when accessing such a file. With a small cluster size, there will be so many more index entries just to keep track of that file. But the biggest drawback here is that if you have a small file, a minimum chunk space is still allocated to contain that very small file. The difference between that small file and your cluster size is the waste that is not being used.
A prime example of this is when you realize that very big disk that you just bought for your windows 98 machine suddenly runs out of space so quickly. You didn't use up all the space. It is because your disk defaulted to a large cluster size and most of your disk space is located in the unused portions of your disk clusters. I have written about Hard Disk Sizes before.
In networking, it is similar. Having a small "cluster" will optimize usage of pipe bandwidth but large packets are just not accepted by your switch or router. They would be dropped as they don't fit in the cluster. Having a large "cluster", jumbo packets are routed just fine; however, there is a large amount of waste whenever you are sending very small packets. The throughput is sub-optimal and can degrade as much as 35%. Again, the reason here is that the space occupied by the frame and the cluster size is essentially wasted.
The two situations above are exactly the same problem but in entirely different domains. In truth, you can't win them all. You choose your cluster size somewhere in the middle and hope that your usage will fall somewhere there. There will be benefits and wastes but like many things in life, they are out of your control.
Publicly accessible Cisco Documentation
I have been fixing a bug that involves writing to memories which take a long time to accept the charge. This used to be very common when FLASH memories were first introduced. Now, we see it in specialized ASICS.
Writing a value to an area of buffer memory which behaves differently than the rest of DRAM or SRAM involves tricking or faking out the processor. CPUs tend to set a timer before it does a write to memory. If the the timer goes out and the CPU doesn’t receive a DTACK from the memories, a processor exception happens. In our case, DTACK would not happen for 100s of milliseconds.
In order to prevent the exception, the CPU must turn off the DTACK timer just before it proceeds to do the WRITEs. In the same manner, it would have to remember to re-enable the timer after such accesses are done.
From a software perspective, each of the long WRITEs would seem like jumping into a black hole. Microsoft has solved this problem in Win32 with a mechanism called Asynchrous IO. In it, the software just hooks worker function and a callback function and goes on its merry way. As the worker finishes, the callback is called and software knows the process has completed. In firmware, we don't have any of those particular luxuries. "Printf" is about the best luxury that we can afford. How do we wait for IO completion? What you (or I) tend to find in a lot of firmware is fixed "for xxx to yyy" loops. Nothing fancy. The processor doesn't do anything but spin until the time it thinks the memory WRITEs are done.
We see a lot of silly code in firmware. However, if it gets the job done... what more can you want?