Power Attainment Engineer - Data Center GPU
Advanced Micro Devices View all jobs
- Markham, ON
- Permanent
- Full-time
- Execute Power Attainment test plans in post-silicon phases in support of Data Center GPU product roadmap optimizing for power, perf/watt and performance.
- Configure and setup ML/AI Datacenter GPU systems for data collection, experiments and test plan execution
- Utilize lab equipment such as oscilloscope, high speed probes, function generator and data acquisition equipment to gather required electrical characterization data for power and performance optimization.
- Actively participate in analysis of post silicon performance and power data collected to ensure integrity of results, provide summary and conclusions of results, drive productization of features
- Analyze and debug interactions between various power management features
- Analyzing data from workload or execution output datalogs using excel or JMP analysis tools manually or developed automation
- Execute ROI analysis of power management features and provide feedback to power management architecture team.
- Support prototyping experiments and development of new GPU features that impact performance and power
- Electrically stress the system, validate the limits of ASIC and system/board components and optimize settings for stability and performance.
- Troubleshoot system-level issues that may occur in test environments and platforms
- Proactively drive continuous improvement for post-silicon power attainment activities
- Participate in development of automation environment in developing scripts automating workloads, enhancing capabilities of execution capabilities in Linux, Python and other support software support tools
- Work in a fast-paced resource constrained environment to build top of the line HPC & AI GPU products
- Provide Technical leadership for electrical validation and power optimization in datacenter platforms.
- Be part of team building, develop and mentor junior engineers into technical leads of future
- Drive process efficiencies, automation and AI for debug and analysis.
- Provide weekly readouts to executives on progress, blockers and next steps.
- Work with Rack & Cluster teams to develop and execute E2E electrical validation test plan, build electrically robust, reliant, stable and performant systems.
- Debug customer issues, collaborate with L1, L2 support and customers to design DOEs to isolate the problem and provide a fix.
- 7 years of hands-on experience as an engineer in semiconductor industry.
- Demonstrated ability to execute and deliver multiple projects in a timely fashion.
- Prioritizing work items in a fast-paced environment and escalating as necessary.
- Excellent grasp of computer organization/architecture, GPU architecture and power management
- Knowledge in power limited performance methodologies and control theory
- Extensive experience in platform optimization. Solid knowledge of Computer I/O.
- Experience with tools for power and performance analysis
- Strong programming skills, scripting experience in Python preferred
- Familiarity with HPC/AI applications, benchmarks would be a big plus.
- Desirable to be proficient in Linux command line environment and Shell scripting
- Deep knowledge of power management techniques like deep sleep, clock gating, pstates etc
- Experience with container technologies (ex. Docker)
- Strong analytical and problem-solving skills with a key attention to detail
- Experience in data analysis, summarization, and presentation
- Excellent presentation and communication skills
- Experience in use and debug of lab tools such as oscilloscopes, DAQs, power measurement capabilities
- Experience working in Windows and Linux environments
- Experience working in data center environments, knowledge of boards, systems, racks, clusters and building large electrically stable systems.