Intelligently architecting the workload to take advantage of multicore (and other answers)
Jim St. Leger is the Platform Technology Marketing Manager in Intel’s Embedded and Communications Group (ECG). Recently Jim spoke with Joe Pavlat, CompactPCI AdvancedTCA & MicroTCA Systems Editorial Director. Joe is also the president of PICMG.
Editor’s Note: Readers will also find it helpful to review a white paper by Lori M. Matassa, Software Technical Marketing Engineer, Intel, on converting PowerPC to IA that can be found .
This architecture migration paper addresses the high-level software implications developers should consider when running on IA. As well, it covers software development tools, another topic of this interview.
Joe: How does hyperthreading work and how does it improve performance?
Jim: What hyperthreading is doing is perhaps no different than what other folks have done over the years to support processors. [It is] the ability to support multiple threads per physical core. From a hardware perspective, it becomes two logical cores. The software perspective tell us it is two threads per core. Now you have the ability to do two threads worth of work on a physical unit. Then of course if you do a dual-socket implementation with multiple cores … you just do the multiplication going through.
Looking at our embedded road map, our business is very broad compared to traditional Intel groups where we have low-power Atom SKUs on the lowest end and very high-performance multisocket DSPs on the high end. Hyperthreading allows us to have a conceptual multithreading multicore offering across the board because some of our Atom SKUs (the embedded ones) support hyperthreading as well. It is easy to give guidance to customers: “Either thread your code or architect for multiple cores to be able to tap into this.”
Joe: There is more and more being written about the diminishing returns of just adding cores and more threads to the processor— a lot of software does not know how to take advantage of that. I have seen articles in places like The Economist saying that after the fourth core results tend to go down unless you really pay attention to your code. What are your thoughts about more than four cores? Is Intel going to plan on that? What software (in particular OS and BIOS) are you going to provide to take advantage of these extra cores?
Jim: We have quad-core implementations with dual sockets supporting hyperthreading, so you can end up with 16 threads on that platform. You have to be a little careful as to whether it is maximum number of cores or maximum number of threads you are talking about, because at the end of the day it is the software that makes it work. When you are designing the board you are just laying it out. The real secret is tapping into the performance, and this is extremely important to multicore platforms. We abandoned the gigahertz race, and are now chasing the core-count/thread-count race. It is all about architecting your platform in an intelligent way based on your workload, and this is even more crucial in embedded than anywhere else.
[With regard to] your comment about diminishing returns – I think if your approach is: “I am just going to add more and more and more threads to my workload and throw more cores at it,” you perhaps would get into a law of diminishing returns.
It is more about intelligently architecting your workload, thinking about whether it lends itself to data parallelism and whether you have granularity issues. When you go from single-threaded to dual threads, are most of those threads going to have the same amount of work to go to two different cores? And it is in really thinking about those approaches that an intelligent architecture can be determined.
There is a security open source application called SNORT and [Intel, doing some internal work] took that single-threaded application and laid it out on a quad-core platform. All they did was thread it, and guess what, the results were I think 10 percent better – it was a relatively small number. And then they said, “why?” And the “why” was because the bottleneck [did not occur in] threading the work but [occurred in] the packet capture and [in the task of] distributing those packets across the other cores. So then they dedicated one core to packet capture and that library and the other three cores to actually doing the work, the performance went up X amount.
That is the kind of approach you need to apply, and when you apply that kind of approach that is where I think you can continue the scaling. Now whether it is 1:1:1 [one-to-one-to-one] add cores, add threads, and your scaling goes up by that amount, that’s utopian. I have seen results on certain applications that get very close to that – so I think it is possible, but I think it has to be intelligently done – from high-level software architectural work and then from applying the tools to make the results scale to the point you need them.
And that is sort of the last point is the software tools, which is the bread and butter. If you have been following the recent press release from our Software and Services business group about Intel Parallel studios, clearly the initial instantiation of that is targeted to broad stream PC developers.
Many Windows applications could take advantage of a tool chain that is optimized from its instantiation for multithreading and multicore.
Joe: What tools are you providing and planning on providing to help software engineers really take advantage of multithreading?
Jim: The easiest one to implement would be our thread optimize library. Our Software and Services group has spent a tremendous amount of time taking our libraries and knowing there are effectively plug-in elements or routines to developers’ menus and saying I am going to do this work the best that I can for multicore and thread this routine, so they can quickly plug that into your application today.
Beyond that, it is going out and optimizing VTune, Thread Checker, and Thread Profiler, to be able to check for race conditions and deadlock conditions, using VTune as an analyzer to [determine] where do you want to insert parallelism into your code, so you then know where to start adding threads and workloads and things of that nature.
Joe: How does Intel VT improve virtualization?
Jim: Virtualization is something that is clearly not a new technology. IBM and others have had it around for maybe 30 or 40 years. What has made it different is two-fold. One is what I will call the idea of bringing virtualization to the masses – let’s call it the large compute environment, the Intel x86 architecture, and that is where folks like VMware are popularizing the concept of the server-enterprise infrastructure perspective.
In our world it is all about creating the optimal minimized platform. I need a certain level of performance, but I do not necessarily need 10x that capability; I need a certain level to hit. I need a certain power envelope, a certain physical size, and of course a certain cost structure.
And when you look at taking a technology that was developed for the enterprise space and applying it to embedded, it is very tough to fit that square peg in a round hole.
So then what happened is you had companies like Intel coming out with hardware-assisted virtualization. We branded it Intel Virtualization Technology or VT, and what that is doing is providing some of the functional paths that had to be done in software in the Virtual Machine Monitor (VMM). [In the embedded space virtual machine monitors are more typically called hypervisors.]
And it is taking some of those functions that cause high software overhead and doing them in silicon. So for example, the fundamental core piece of this is the processor virtualization, which in the Intel VT family is called VT-x. That has created a new operating mode called VT root mode. Where you now have the VM it has a higher level of privilege in there, and that is operating in that mode and your operating systems and applications are running is a less-privileged mode in ring 0 and ring 3. You have created this tiered structure so that your VM can exist in this more privileged mode, tap into the hardware assist through VT-x (and there are other technologies such as VT-d for Directed I/O and VT-c for connectivity), which really help create a virtualized platform
Because it is not just about processor virtualization, there is memory virtualization, there is I/O virtualization and direct assignment, and of course there is device and Ethernet virtualization. How do you share all those elements to create in the end a solution that is doing something as opposed to just a server that is crunching numbers.
The net of it all is that virtualization technology is accelerating some of the VMM functions and tasks in silicon to reduce the overhead, so you have a higher-performing system.
In Embedded you get into thin hypervisors from folks like VirtualLogix and LinuxWorks and others to be able to enable it for embedded, to maintain determinism and run an RTOS on one side and Linux or Windows on another VM.
Joe: Intel talks about three key vectors: performance, scalability, and low power, could you comment on the challenges in each of those areas?
Jim: Across all of them there are always some power aspects of things, but in the low-power segment it tends to be the mix of extremely low power. (“Power” is relative; it depends on one’s perspective; for example, AdvancedTCA and PC/104 have different perspectives.)
For us the low-power challenge is maximizing the performance in a very constrained power situation, which is for sure under 10 W, likely under 5 W, and some folks would argue it is under 2 W.
But let’s take the less than 10 W sector, in that comprehensive envelope of your entire platform – processor, chipset, memory, and other devices. How are you going to fit all those things in there in that power envelope while continuing to offer more performance for a customer who is looking to do more and more things? You are typically adding more connectivity to those platforms. Then you get into some management of those platforms and other challenges. For us that is a great place to say, “We can take our manufacturing excellence and the fact that we are always on the leading edge of manufacturing technology.” We moved to 45nm; next year we are going to move to 32nm.
I think you have seen our data on what is called the tick-tock: every other year we shrink the manufacturing process, then we update the microarchitecture, then we shrink the manufacturing process, and so on. [This approach] really helps us meet the challenge of these deeply embedded low-power applications.
Joe: As you say everyone wants more and more memory and more and more performance, and removing heat is getting to be a major obstacle.
Jim: We are always working very hard to make sure that we can take our highest-performance solutions and processors and fit them into applications like AdvancedTCA. Now that almost always requires us to do some special and unique things, which we absolutely go and do because it is a very critical and important market for us.
[For example] we recognized that fully buffered DIMMs had a value proposition for the server market, but they also had a power/thermal penalty, and we created a chipset to offer standard DIMMs in place of fully buffered DIMMs.
What we have also done over the years – and it is no different today on our Nehalem Xeon 5500 series platform – is to create additional processors that take advantage of lower power levels. The L5508 and L5518, are targeted at the 38 W and 50 W processor level respectively while also delivering extremely high performance. Fitting into a constrained enveloped of a bladed server like AdvancedTCA or some of the proprietary servers presents a performance-per-watt metric that is a different performance-per-watt metric than [that presented by a] COM Express board.
Joe: There is ainitiative to extend AdvancedTCA beyond the Central Office. We are trying to get to the 800 W level per double-wide board, and that has implications in many markets.
Jim: What is interesting is that on the one side you guys are trying to figure out how to fit more in, so that the more becomes more performance dealing with heat loads and power loads,. From our side that will allow us to fit in different places and maybe somebody will say, “Hey I like that approach because now I can take maybe a Xeon 5540 processor that is an 80 W processor, but I can fit it into there, and it has a different value proposition, higher clock speed, and the like.”
And the same time, there are always going to be folks that put a premium on peeling that down to a 40 or 50 W range to try to fit it in CompactPCI.
Joe: Many architectures are performance limited because they are thermally limited. What is Intel doing to improve the performance-per-watt – because at the end of the day that is the most important metric.
Jim: And the other element to high-performance is from a system perspective. I have largely focused on just the processor, but if you look at our DS-5500 product platform it has a couple of extremely innovative and new solutions. Number one, they are integrating the memory controller. So now you are not only reducing a component and footprint, but you are also reducing latencies at the same time, so the performance from a platform perspective [and] from a latency perspective goes up.
Number two, we add in QuickPath interconnect, mesh point- to-point links between the processors and between the platform controller to be able to offer that as a higher-performance element of it. And although now you are into sort of a mesh network or a multicore/multisocket platform, that is going to be able to command higher I/O and lower latency elements of it. Those are different elements from pure processing performance. It’s more system throughput performance.
At the end of the day most customers, that is what they are trying to do. They are processing some workload, they are moving data in and out, and they want to do both of those faster.
We’ve gone to three-channel memory and the dual-memory architecture, which also has advantages as people move to 64-bit operating systems having more performance. Integrating those kinds of features and functionality into a platform is going to continue to add performance vectors.
TurboBoost [for example] directly ties into the performance aspect and the question of what are you doing on the performance side. If you got people to sit down and think about it they would realize that for some applications they may not have an endless stream of workloads and threads coming in to occupy every core nonstop.
Joe: Slow down and cool down.
Jim: So if you have a couple [of cores] that are not running full out, you can overclock the other one. TurboBoost technology is designed to be able to do that, and it works on two functionalities: One is, if the system monitoring features realize: “Hey you are not at the extreme limit of the platform safety levels,” you can overclock all four of the cores in a quad-core system.
In a scenario where two cores are idle, or three cores are idle, you can overclock the working cores at an even higher level. So it is really intelligent scaling from a clock perspective for more performance as opposed to power savings.
The counterpart to this is the SpeedStep technology which scales down based on speed states to be able to save more power. This [TurboBoost] is scaling things up based on availability of power overhead.
Joe: There a big push toward adopting the IEEE 803.ap 10 gig per pair 10GigKR standard. We are working hard in PICMG to take the channel bandwidth from 10 Gbps to 40 Gbps. We have issues about budgets: How much of the loss budget does the backplane get, how much does the connector get, how much does the board get. This is something that the IEEE did not solve, which we need to solve for interoperability. When is 10 gig per pair silicon going to be available?
Jim: 10 gig is clearly on our radar screen and on our road map – it is a technology that we know the industry is starting to broadly adopt. It is funny in some ways that I can think back five to six years ago to having 10 gig discussions with the usual suspects in PICMG, the larger systems providers, and they all wanted to be on the forefront of 10 gig, but it seemed to be something that was not quite there yet from a demand perspective. What is now happening that will accelerate that [demand] is of course is that multigigabit solutions have been out there for a long time, people are looking to move to the next level. And the enterprise infrastructures are rapidly moving there as well.
From our perspective our Ethernet business will certainly be there, we want to maintain our strong position in that space.
Joe: Everybody is starting to talk about when we will have 40 gig channels. IPTV and video on demand is going to drive a lot of that.
Jim: And those are the applications that we are looking at. So surely from a standpoint of something being considered a requirement for an application we think it is a good fit for us – we would want to make sure we have solutions.
Jim St. Leger is the Platform Technology Marketing Manager in Intel’s Embedded and Communications Group (ECG.) He manages a virtual team focused on the embedded industry’s multicore processor adoption and utilization of platform technologies. He previously was the Marketing Director of the Intel Embedded and Communications Alliance. Jim joined Intel in 1999 in the Embedded Intel Architecture division managing the Applied Computing Platform Provider program, a predecessor to the Alliance. He has also held positions managing system and solutions providers and has a plethora of experiences in collaborative engagements with third parties vendors. Jim holds an undergraduate degree in mechanical engineering from Rensselaer Polytechnic Institute and an MBA and Master of Engineering Management from Northwestern University’s Kellogg School of Management and McCormick School of Engineering respectively.