The IT troubleshooter drops his pack into a chair across the table from me and sits down. “We take care of…problems” he says. The problems he’s referring to are sluggish EDA tools. Perhaps you’ve encountered a few. Perhaps you suspected the performance problems were the fault of the tools. Sometimes, it is but more often, it’s not. It’s the infrastructure running the tools. Peter Vincent’s mission is to debug IT infrastructures that inhibit EDA tool performance and he has some free tips for you if you need help in this department.
Vincent says that the three biggest problems are:
- Linux – Especially software revisions and configuration problems
- Storage – Not all storage performs equally well and EDA hammers storage hard enough to surface any weaknesses in storage subsystems
- Network – Typically the “elephant in the room” says Vincent.
He starts by talking about network problems. Today, says Vincent, many companies outsource IT so thoroughly that there’s often no one on site to actively monitor network performance. Often, companies aren’t even aware that they have network problems and outsourced IT support is frequently passive; it responds only to trouble calls.
What you want, says Vincent, is active network monitoring to discover network problems quickly…perhaps even before the problem is noticeable. For example, Vincent cites one case where a company had more than 1000 active Ethernet ports in its network and there were transmission errors (excessive collisions, packet discards) on more than 20% of these ports. You want to guess whether that might cause a discernible network performance problem? Of course it does.
Often, says Vincent, these kinds of problems are caused by misconfigured ports. “How can they be misconfigured?” I ask. “Isn’t port configuration usually automatic?” I’m thinking of my own experiences where I plug the Cat-5 cables into the PC and router and “things just work.” Ah, says Vincent, some installers turn autoconfiguration off because they “know better.” Sometimes, even autonegotiation fails. You need to check the port to make sure it’s properly configured.
Vincent cites another case where a company had a modern, fast, 10Gbps network at its facility but 2Mbps pipes to remote offices running thin clients. You think that might cause a problem? Of course it does. The size of the problem depends on the task being performed. In Vincent’s experience, thin EDA clients can work—to a point. “However, it’s not for analog design or polygon pushing” he says. The classic rule of thumb is that thin EDA clients work if there’s less than 100 msec of latency in the connection. For layout tasks, make that less than 50 msec. Otherwise, the EDA tools will not feel appropriately interactive you you’ll get cranky designers.
Vincent then switches the topic to storage. Most large EDA customers have switched to NetApp storage, he says. Smaller customers try to save money, says Vincent, even to the point of buying consumer-grade storage at the local Best Buy. After all, it’s so cheap to buy that way. You get the clear impression in Vincent’s tone of voice that this is a “very bad idea.” Consumer-grade I/O for storage devices cannot provide the bandwidth required by EDA tasks, he explains. There might be as much as a 20x performance difference. So indeed, it is a bad idea.
Storage devices also provide opportunity for misconfiguration. Vincent cites the case of a storage system where the client company attempted to aggregate ports to a storage device to boost bandwidth. However, the aggregation wasn’t done correctly and instead of doubling the bandwidth by pairing the ports, the configuration was actually set up to alternate between the ports, adding needless delay because each packet transmission had to first find the right port to use.
Then Vincent switched to Linux. “If you’re on Red Hat 4, switch to version 5 immediately” he says. There are now way too many bugs to patch in version 4 at this point and many of those bugs are gone in version 5. Beyond that simple fix, Vincent has seen other types of Linux problems related to imaging consistency and root-access problems. To start with, buy consistent hardware he says. You don’t have to buy the latest and greatest servers but your hardware should be consistent. Otherwise, the Linux running on each server will need to be a little different. That will cause headaches.
“Consistency is the key” he says. Inconsistent hardware requires hand patching, which is never consistent. Use a kickstart server to start identical Linux images in all servers. That’s the only path to consistency, says Vincent.
Vincent concludes our discussion with a few more pointed suggestions:
- Right now, servers based on the Intel Westmere-EP core provide a significant amount of performance boost to simulation—about a 30% increase. Vincent suspects it’s the particular mix of L1 and L2 processor cache that makes the difference.
- Stay two to three firmware releases back on network and storage equipment unless there’s a known bug you need to fix with a firmware upgrade. Introducing firmware patches as soon as they become available puts your EDA infrastructure on the bleeding edge. That’s probably not a good place for EDA infrastructure.
- Schedule two infrastructure maintenance windows per year (or at least one) to avoid unscheduled maintenance, also known as infrastructure crashes. It’s amazing how regular firmware upgrades and equipment reboots can clean out the cobwebs on a variety of infrastructure equipment. Memory leaks have not become extinct. Firmware needs to be upgraded. Deal with it.
Vincent is a senior member of the worldwide Cadence EDA Infrastructure Acceleration Services team managed by Paul Rose. He and the team help customers get more performance out of their EDA infrastructures. The team has done “hundreds” of “best practices shares,” which are 4-hour meetings with IT and CAD teams at EDA clients to discuss infrastructure-based EDA performance enhancements on general terms. He and the team have performed more than 50 assessments of EDA infrastructure systems to fix problems and boost performance. “We do this stuff all the time,” said Vincent.
Interesting. I’ll keep an eye on this stuff from now on.
So those incredibly expensive Net App boxes really are worth it, eh?
Can’t speak to the topic personally John, but Peter Vincent told me that the large EDA users were almost all on board with NetApp.
I’m in the middle of a project to make EDA tools run faster. Virtuoso would run slow intermittently and this complicated the diagnoses. Each group did look at the system they were responsible for e.g. Linux, license server, storage, and the network but in isolation and things looked to be operating normally.
It quickly became obvious a more holistic systems approach was needed to address the intermittent performance issue and to measure the designer experience that included all the respective systems. This involved creating Cadence replay files that replicated the designers experience i.e. VXL, ADE, layout, simulation, waveform viewing… Cadence sessions were then schedule every hour for a few days and hard data collected. Generated nice reports and graphs but still could not identify the cause of the problem the designer was experiencing.
Phase II. Overload the system but do no harm
This was done by scheduling the Cadence sessions in the middle of the night for a few hours for a few days. The number of Cadence sessions exceeded the number of licenses. Multiple sessions were schedule simultaneously. More hard data. Generated even nicer looking graphs. Finally we were able narrow it down to one of the file servers.
That’s an interesting approach to root cause analysis and I like the idea of pushing the environment to the limit to see where the limitations are. We typically use a combination of process tracking and monitoring to identify where any bottlenecks exist. I’d be happy to share best practices with you and your team if that’s of interest.