背景图
MIIS - Advanced Translation Technology Portfolio

2026年05月15日 作者头像 作者头像 ArnoX 编辑

This mini-portfolio highlights selected work from my Advanced Translation Technology course at the Middlebury Institute of International Studies at Monterey. The projects below reflect my hands-on experience with machine translation training, translation technology evaluation, data preparation, post-editing workflows, and translation management systems.

Rather than treating translation technology as a black box, these projects helped me understand how language data, domain fit, quality evaluation, workflow design, and human review all work together in a real localization environment. My focus throughout the course was practical: how to use translation technology responsibly, how to measure whether it is actually useful, and how to design workflows that improve quality, productivity, and cost efficiency at the same time.

This portfolio includes two major projects:

  1. A custom machine translation training project using Microsoft Azure Custom Translator
  2. A translation management system evaluation and selection case study comparing XTM and WXRKS

Together, these projects show how I approach translation technology from both a technical and operational perspective: building, testing, evaluating, and recommending solutions that can support real localization teams at scale.


Microsoft Azure.png

Project 1: Custom MT Training with Microsoft Azure Custom Translator

Project type: Machine Translation training and evaluation
Language pair: Chinese to English
Domain: Chinese government work reports and official documentation
Platform: Microsoft Azure Custom Translator
Training dataset: Google Drive folder: Azure MT Training Dataset

Project Overview

For this project, my team built and evaluated a custom-trained machine translation engine using Microsoft Azure Custom Translator. The goal was to test whether a domain-specific MT engine could support the translation of Chinese government work documents into English in a way that was faster and more cost-effective than fully human translation, while still meeting the quality expectations required for official and politically sensitive content.

The project was designed as a pilot, not just a technical experiment. We wanted to answer a practical business question: Can a custom MT engine produce output that is stable enough for post-editing and review in a real government-document translation workflow?

To evaluate that, we looked at three major dimensions: quality, productivity, and cost.

Workflow

The project started with data preparation. We collected Chinese-English government-related texts, extracted usable content, aligned source and target segments, and cleaned the training data before uploading it into Azure Custom Translator. The preparation process included tools such as Trados Studio, memoQ, Olifant, VS Code, and AI-assisted cleanup methods.

After preparing the training, tuning, and testing datasets, we completed multiple rounds of MT training. Across the pilot, we ran six training rounds in Azure Custom Translator and tracked BLEU score changes after each round. The BLEU scores moved from 25.7 to 27.7, with later rounds staying around the high-27 range. While the improvement was moderate, the results gave us a useful view into how training data quality, alignment consistency, and domain relevance affected MT performance.

The team also planned additional training rounds, with a target BLEU range of 35–50 for a more production-ready engine. This helped frame the pilot as an early-stage feasibility test rather than a final production deployment.

Quality Evaluation

Because the content involved government documents, quality could not be evaluated only by automatic metrics. BLEU scores were useful for tracking relative changes across training rounds, but they were not enough to determine whether the output was acceptable for official use.

To address this, we designed human evaluation criteria based on practical risk. The expected post-edited output needed to avoid critical mistranslations, politically risky wording, factual inaccuracies, and terminology errors that could affect policy meaning. Minor style issues were acceptable, but critical meaning errors were not.

We also used an MQM-style quality framework to define pass, borderline, fail, and automatic fail scenarios. For example, a strong pass would allow a limited number of minor issues and no critical errors, while any critical issue would trigger an automatic fail. This helped connect quality evaluation to real-world decision-making instead of relying only on a single score.

Productivity and Cost Analysis

One of the most useful parts of this project was comparing post-edited machine translation with fully human translation.

For a 2,000-word sample, the pilot estimated that post-editing took about 4.69 hours, while fully human translation was estimated at around 15.5 hours. Based on this comparison, PEMT was about 70% faster than human translation in the pilot setting.

The cost comparison showed a similar pattern. The estimated PEMT cost was $253.80, including post-editing and review. The estimated human translation cost for the same volume was $780, including translation and review. Based on these numbers, PEMT was estimated to be about 67% cheaper for the pilot sample.

At the same time, we treated these numbers carefully. The pilot showed strong productivity and cost potential, but a real production workflow would still need to account for training setup, data cleaning, quality evaluation, engine maintenance, and human review.

Key Takeaways

This project helped me understand that MT quality is not only about the engine. It depends heavily on the quality of the training data, the consistency of the domain, the clarity of evaluation criteria, and the design of the human review workflow.

A few lessons stood out to me.

First, data preparation matters as much as model training. Poorly aligned or noisy segments can limit MT performance even when the platform itself is strong.

Second, automatic metrics are useful, but they need to be paired with human evaluation. For sensitive content, a higher BLEU score does not automatically mean the output is safe to publish.

Third, PEMT can create meaningful productivity and cost gains, but only when the output is stable enough to support efficient review. If the MT output requires heavy rewriting, the workflow can quickly lose its advantage.

Finally, this project gave me a more realistic view of how AI translation technology should be implemented in production. The goal is not to replace human judgment, but to build a workflow where technology handles repeatable translation work more efficiently and humans focus on accuracy, nuance, risk, and final quality.

Skills Demonstrated

Through this project, I practiced:

  • Custom MT training with Microsoft Azure Custom Translator
  • Chinese-English government-domain data preparation
  • Source-target alignment and dataset cleanup
  • BLEU score tracking and interpretation
  • PEMT productivity and cost analysis
  • MQM-style human quality evaluation
  • Translation technology workflow design
  • Risk-aware evaluation for official and sensitive content

This project strengthened my understanding of how AI translation systems can be evaluated and operationalized in a real localization environment. It also reinforced a key principle I would bring to future localization and translation technology work: AI is most valuable when it is connected to clear workflows, measurable outcomes, and responsible human review.


TMS Evaluation.png

Project 2: TMS Evaluation and Selection Case Study

Project type: Translation Management System evaluation and selection
Scenario: ByteDance-style high-volume localization operations
Systems compared: XTM vs. WXRKS
Presentation video: TMS Evaluation and Selection Final Presentation

Project Overview

For the second project, my team designed a TMS evaluation and selection case study for a large-scale, fast-moving localization environment. The scenario was built around a ByteDance-style organization with large volumes of multilingual content, current and future localization needs, and multiple stakeholder groups involved in daily localization operations.

The main business question was: Should the organization continue using XTM, a mature enterprise TMS, or adopt WXRKS, a more modern workflow-oriented platform, for future localization operations?

Instead of comparing the two systems only by feature lists, we built an evaluation framework based on real business requirements, stakeholder pain points, pilot testing, and weighted scoring. The goal was to make a recommendation that reflected actual localization operations, not just tool preference.

Evaluation Methodology

We structured the evaluation around five steps:

  1. Identify key business requirements by stakeholder
  2. Build a weighted scorecard
  3. Define a pilot project
  4. Score XTM and WXRKS
  5. Present a final recommendation

This structure helped us move from a subjective tool comparison to a more business-driven decision model. It also made the evaluation easier to explain to both technical and non-technical stakeholders.

The project considered the needs of six major stakeholder groups: Localization Program Managers, Localization Operations/PMs, Linguists/Reviewers/Vendors, Product and Content Teams, Engineering/System Teams, and Leadership/Procurement/Finance. Each group had different concerns, including workflow design, project setup, editor usability, launch readiness, integration capability, reporting visibility, cost, and long-term business value.

Business Requirements

From the stakeholder analysis, we identified several key business requirements that a future-ready TMS should support:

  • Workflow flexibility for different project types and approval paths
  • Easy project setup for daily localization operations
  • Scalability for high-volume multilingual work
  • Strong editor usability for linguists and reviewers
  • Collaboration support for comments, issues, and decisions
  • Fast turnaround time for product and content launches
  • QA and consistency controls to protect brand voice and user experience
  • Integration capability to reduce manual handoff and system friction
  • Reporting and visibility for status tracking and performance management
  • Reasonable total cost of ownership

This part of the project was especially useful because it showed that TMS selection is not only a localization tooling decision. It is also an operations, engineering, finance, and stakeholder-management decision.

Weighted Scorecard Design

To make the comparison more objective, we created a weighted scorecard. Each TMS was scored on a 1–5 scale, where 1 represented poor performance and 5 represented excellent support for the requirement. The final weighted score was calculated by multiplying the raw score by the requirement weight.

The category-level weights reflected business priorities. Workflow and project management carried the highest weight at 20%, followed by scalability and performance, linguist and reviewer experience, quality and consistency, and integration and technical fit. Reporting and visibility, as well as cost and implementation risk, were also included in the final evaluation.

This scorecard approach helped translate workflow pain points into measurable decision criteria. It also made the recommendation more transparent because stakeholders could see not only which system scored higher, but also why certain areas mattered more than others.

Pilot Project Design

To test the systems in a realistic workflow, we designed a pilot around a product launch localization package. The pilot included a marketing page, UI strings, and a Help Center article, with English as the source language, 3–5 target locales, and an estimated scope of 2,000–5,000 words.

The pilot participants included one PM, one reviewer, one engineering observer, and one to two linguists or simulated vendors. The tasks tested in both systems included project creation and setup, user permissions, file ingestion, translation editor usability, translation memory and terminology support, review and comment workflow, QA checks, status tracking, export/delivery, and API or integration feasibility.

This pilot design was important because it tested the TMS where it actually matters: daily operations. Instead of asking whether a platform technically had a feature, we looked at whether that feature supported a smoother and more scalable localization workflow.

Evidence Collected

The evaluation was based on multiple types of evidence, including task completion observations, time-to-complete comparisons, friction points and blockers, PM and linguist feedback, reviewer and engineering feedback, QA effectiveness, workflow clarity, scorecard ratings, and screenshots for visual comparison.

This made the evaluation more practical and defensible. A TMS can look strong in a demo, but the real test is whether PMs can set up projects quickly, linguists can work efficiently, reviewers can leave clear feedback, engineers can support integrations, and leadership can get enough visibility into project status and operational risk.

Comparative Findings

The comparison showed a clear trade-off between XTM and WXRKS.

XTM was positioned as a mature enterprise platform with broad feature coverage, proven standard workflows, and familiarity for current users. However, it could feel heavy and complex in fast-moving daily operations.

WXRKS was positioned as a more modern workflow experience with stronger daily usability, better collaboration flow, and a better fit for fast-moving content. The main risk was migration and onboarding, especially if the organization needed to move from an established system to a newer platform.

This comparison helped us avoid an overly simplistic recommendation. XTM had strong enterprise credibility, while WXRKS appeared better aligned with the future-state workflow we were designing.

Final Recommendation

Our final recommendation was to adopt WXRKS as the preferred TMS for future localization operations, using a phased rollout instead of a full immediate migration.

The reasoning was that WXRKS offered stronger workflow flexibility, better PM and linguist usability, clearer collaboration and review flows, better operational visibility, and a stronger fit for high-speed, high-volume localization work.

At the same time, we recommended risk controls for implementation. The rollout should begin with one product area, train PMs, reviewers, and linguists, validate integrations before full migration, and expand gradually after pilot success.


Key Takeaways

This project helped me understand how to evaluate translation technology from a business and operations perspective. A good TMS is not just the one with the longest feature list. It is the one that best supports the organization’s actual workflows, stakeholder needs, scalability requirements, and long-term localization strategy.

The biggest takeaway for me was that TMS selection should be evidence-based. Strong evaluation requires stakeholder analysis, weighted requirements, realistic pilot tasks, and a clear understanding of implementation risk. It also requires balancing short-term operational familiarity with long-term workflow improvement.

This project also strengthened my ability to communicate technical and operational recommendations to different audiences. For localization teams, the focus is daily usability and quality. For engineering, the focus is integration and maintainability. For leadership and procurement, the focus is cost, risk, and long-term business value. A strong recommendation needs to connect all of these perspectives.

Skills Demonstrated

Through this project, I practiced:

  • TMS evaluation and selection
  • Stakeholder requirement analysis
  • Weighted scorecard design
  • Localization workflow assessment
  • Pilot project planning
  • Tool comparison across business, linguistic, technical, and financial criteria
  • Risk-aware implementation planning
  • Executive-style recommendation writing
  • Localization operations strategy

This project reinforced one of the most important lessons from the course: translation technology decisions should be driven by workflow reality, not tool demos. A system only creates value when it helps real teams move faster, collaborate better, reduce friction, and maintain quality at scale.

Reflection

Across both projects, this course gave me a more complete view of translation technology as a real-world localization function. The Azure MT project focused on how to train, evaluate, and operationalize a custom machine translation engine. The TMS evaluation project focused on how to assess localization platforms from a stakeholder, workflow, and business-value perspective.

Together, they helped me connect technical experimentation with practical implementation.

The biggest lesson I took away is that translation technology is never only about the tool itself. A strong MT engine still needs clean data, clear evaluation standards, and responsible human review. A strong TMS still needs well-defined workflows, stakeholder alignment, implementation planning, and measurable success criteria.

In future localization and translation technology work, I would bring this same mindset: start with the business problem, understand the workflow, define what quality means, measure performance with the right evidence, and recommend solutions that are scalable, practical, and responsible.

These projects also align closely with the type of work I want to keep doing: localization program management, translation technology operations, AI-assisted workflow design, quality evaluation, and cross-functional localization strategy. My goal is to help global teams use technology not just to move faster, but to build localization systems that are easier to manage, easier to measure, and better for users across languages and markets.