*The following is an overview of some of the significant data science work that goes into transforming unstructured employee profile data into the more usable and accessible format published by Revelio Labs. This piece was written by one of our data scientists, Max Rabinovich.*

```
import numpy as np, pandas as pd
from matplotlib import pyplot as plt
import polyglot
from polyglot.detect import Detector as PolyglotDetector
from polyglot.text import Text as PolyglotText
plt.rc('text', usetex=False)
plt.rc('font', family='serif')
data = pd.read_pickle("/home/rabinovich/workspace/dat/data0/skill_language_blog_data.pkl")
for name, value in data.items():
globals()[name] = value
```

An important part of our focus at Revelio Labs is on analyzing professional skills. Think (in the data science world): Python, pandas, Bayesian machine learning, convolutional neural networks, etc. We really do have coverage of skills at that level of specificity – across the full range of occupations.

With such broad coverage, we shouldn’t be surprised to find that the raw data needs some serious cleaning. In this post, we’re gonna talk about how we address one particularly significant data cleaning problem – separating out skills expressed in English from those expressed in other languages.

The material we have to work with is counts of skill occurrences in online professional profiles, both limited to the US and globally. Counts represent the number of individuals who listed the skill in their profile, and the total number of distinct skills is very large (~2.8M) because there’s a huge amount of variety in the way people report their skills. (Not to mention misspellings, hobbies, and random phrases….) For this analysis, we’ll limit ourselves to the ones that occur most frequently, the top 30K or so.

Our first idea is that English-language skills should occur disproportionately within the US data. In other words, if we look at the ratio of US counts of a skill to global counts of that skill (which we call the US-to-global-ratio, or USGR for short), we expect that the skills with the highest ratios will be expressed in English and vice versa. That doesn’t actually \emph{have} to be true; for instance, some English skills might be limited to occupations that are more common outside the US. Also, to the extent that other languages (like Spanish) are commonly spoken in the US, this method might tend to classify skills expressed in those languages as English. Still, it seems like a good first thing to try.

To explore how well this idea works, we tried several US-to-global-ratio cutoffs. For each, we ranked the skills by their US-to-global-ratio (in increasing order) and then found the point in the ranking where the ratio crossed a reasonable set of thresholds (0.01, 0.03, 0.05, 0.07). That’s the point at which we would cut the skills. In other words, if a skill occurs *before* the cutoff, it would be excluded; if it occurs after, it would be kept.

One way to think about this is that we’re trying to classify skills using a single feature – in which case choosing a threshold for that feature is really all we can do!

Here’s a sampling of the first 50 skills after the cutoff in each case (column names refer to where the cut in the ranking is made):

`kept_skills_by_us_cutoff[::5]`

The results in this table give some idea of what’s going on. It looks like the concentration of non-English terms is higher low in the US-to-global-ratio ranking, which is good news for our classification idea. It also looks like the right cutoff might be somewhere around 4000: based on this (small) table, ~3900 is too low, while ~4300 seems high enough.

Probably we could manually do a binary search in that range, or a slightly larger one, and come up with a pretty decent threshold. But is there a way to choose a threshold that’s more strongly supported by the data? After all, manual tuning depends on eyeballing, and eyeballing is prone to all sorts of human errors.

It turns out that the answer is yes. To explain that answer, though, we need to take a detour through a different approach to classifying skills into languages.

We use the polyglot Python library, which uses a naive Bayes model to classify strings by language. Technically, the model also provides probability estimates. Since these are concentrated near 0 and 1, though, we feel comfortable just using the classifications.

Given such a model, it’s tempting to think we can outsource our entire problem to it. But that doesn’t work – in part because the classifier misses a decent number of skills that score well on the US-to-global-ratio and are, in fact, probably English. And in part because of the opposite problem.

```
ratios_with_language[["skill", "language", "ratio"]][ratios_with_language.language != "English"]\
.sort_values(by="ratio", ascending=False).head(250).iloc[::25]
```

```
ratios_with_language[["skill", "language", "ratio"]][ratios_with_language.language == "English"]\
.sort_values(by="ratio").head(250).iloc[::25]
```

So how can we use the language classifier to \emph{improve} our classification based on US-to-global-ratio? Specifically, can the language classification point the way to the correct threshold to use?

Well, we gave away the answer before: yes. If we believe the premise that low US-to-global-ratio indicates a higher probability of being non-English (and vice versa), then we should expect that non-English skills will cluster low in the US-to-global-ratio ranking. And although the polyglot classifier isn’t perfect, we might be able to detect a similar tendency for polyglot to classify low US-to-global-ratio skills as non-English.

To see if that’s true, we plot the ratio of non-English-classified words in a window against the US-to-global-ratio rank. (We call the fraction the “local non-English fraction” for precision.) The results are striking.

```
plt.clf()
ax1 = plt.subplot(1, 3, 1)
ax1.plot(np.arange(ratios_with_language.shape[0]),
moving_average_width5)
#ax1.set_aspect(aspect=1.0/ax1.get_data_ratio())
ax1.figure.set_size_inches(16, 4)
ax1.set_title("radius = 5")
ax1.set_xlabel("Rank by USGR")
ax1.set_ylabel("Local Non-English Fraction")
ax1.set_ylim((0.0, 1.0))
ax2 = plt.subplot(1, 3, 2)
#plt.show()
ax2.plot(np.arange(ratios_with_language.shape[0]),
moving_average_width25)
#ax2.set_aspect(aspect=1.0/ax2.get_data_ratio())
ax2.figure.set_size_inches(16, 4)
ax2.set_title("radius = 25")
ax2.set_xlabel("Rank by USGR")
ax2.set_ylabel("Local Non-English Fraction")
ax2.set_ylim((0.0, 1.0))
ax3 = plt.subplot(1, 3, 3)
ax3.plot(np.arange(ratios_with_language.shape[0]),
moving_average_width50)
ax3.figure.set_size_inches(16, 4)
ax3.set_title("radius = 50")
ax3.set_ylabel("Local Non-English Fraction")
ax3.set_xlabel("Rank by USGR")
ax3.set_ylim((0.0, 1.0))
plt.show()
```

`moving_average_width5[:3000].mean(), moving_average_width5[5000:].mean()`

(0.6445386844636845, 0.1433632077797458)

```
print(moving_average_width5[:3000].std(), moving_average_width5[5000:].std())
print(moving_average_width50[:3000].std(), moving_average_width50[5000:].std())
print(ordered_fractions_not_english[:3000].std(), ordered_fractions_not_english[5000:].std())
```

0.1477115778612066 0.11023579704168593 0.05143211369716867 0.04923196662727674 0.03918807506026525 0.0423962747772698

Two points worth making about these plots.

First, even with pretty minimal smoothing, there’s a detectable shift in average that happens somewhere between rank < 3000 (0.64) and rank >= 5000 (0.14).

Two, once we widen the window to 50, the variance on each side of that divide is already quite small: about 0.05 on both sides, compared to between 0.11 and 0.15 with a window of 5. Since every increase in window size limits our ability to find a precise transition, we settle for a window of 100 for our analysis.

These are the results we see.

(Side note: Plots like these are irresistible to anyone who’s worked on statistical theory or statistical mechanics. Sharp nosedives point to phase transitions and phase transitions are an inherent good, as far as I can tell from the literature.)

```
plt.clf()
ax1 = plt.subplot(1, 2, 1)
ax1.plot(np.arange(ratios_with_language.shape[0]),
ordered_fractions_not_english)
ax1.set_title("radius = 100")
ax1.set_xlabel("Rank")
ax1.set_ylabel("Local Non-English Fraction")
ax1.set_xlabel("Rank by USGR")
ax1.set_ylim((0.0, 1.0))
ax1.figure.set_size_inches(12, 4)
ax2 = plt.subplot(1, 2, 2)
rank_range = np.arange(2000, 6000)
ax2.plot(np.arange(ratios_with_language.shape[0])[rank_range],
ordered_fractions_not_english.iloc[rank_range])
ax2.figure.set_size_inches(12, 4)
ax2.set_title("radius = 100")
ax2.set_xlabel("Rank by USGR")
ax2.set_ylabel("Local Non-English Fraction")
ax2.set_ylim((0.0, 1.0))
plt.show()
```

Looking at the plots, it seems like there’s a pretty sharp transition in the local non-English fraction somewhere around rank 3500 and rank 4500. That’s actually in line with our initial manual exploration of possible USGR cutoffs, which gives us some more confidence that the idea of USGR cutoffs is a good one to begin with. And that our manual exploration gave roughly the right answer.

The next step is to use the sharp transition in local non-English fraction to choose a rank cutoff. We try a few reasonable-seeming thresholds on the local non-English fraction: 0.2, 0.25, 0.3, 0.35. They lead to rank cutoffs between ~4000 and ~4500, so somewhat more aggressive than our manual tuning in discarding possibly non-English skills.

`kept_skills_by_lang_cutoff.iloc[::5]`

Looking at the results, it seems like the highest rank threshold here is the one to go with. It might not be optimal; we could conceivably go even higher and manually tune more. Since we’ve already dug pretty deep and it seems like we’ve caught a large chunk of the non-English skills, we’re choosing to leave well enough alone and choose ~4500 as out cutoff.

In a sense, that’s the bottom line of our analysis. There remains one indisputable fact, however: playing with data is fun. So let’s indulge a little and backtrack to see if the sharp transition we found through the language classification method shows up in the USGR data alone.

The next plots show that the answer is basically yes. Somewhere between ranks 3500 and 5500 there’s definitely a transition: the USGR shifts from ~0.025 to ~0.2, a change of a full order of magnitude in just 1000 ranks, and by far the fastest change over the entire curve. Fortunately that transition point is roughly at the same point in the ranks as our previous method suggested, though (as we noted at the outset) the USGR points to excluding fewer skills.

```
ranks = np.arange(us_to_global_ratios.shape[0])
plt.clf()
ax1 = plt.subplot(1, 2, 1)
ax1.plot(ranks,
us_to_global_ratios.sort_values().values)
ax1.figure.set_size_inches(12, 4)
ax1.set_xlabel("Rank by USGR")
ax1.set_ylabel("USGR")
ax2 = plt.subplot(1, 2, 2)
rank_range = np.arange(3500, 5500)
ax2.plot(ranks[rank_range],
us_to_global_ratios.sort_values().values[rank_range])
ax2.figure.set_size_inches(12, 4)
ax2.set_xlabel("Rank by USGR")
ax2.set_ylabel("USGR")
plt.show()
```

A final thought here: the USGR seems to show a sharp transition not only in its value, but also in its rate of change (which is to say, the curve exhibits a kink). Could we detect that, too?

Since we’re trying to detect a sharp transition in the rate of change, we need to look at the differences between subsequent elements in the sequence. We can start by just taking the raw differences between successive elements, which leads to the plot on the left below. We can say pretty confidently that there something happens around rank 5000, even with the local fluctuations apparent in the plot. On the right side, we smooth out those fluctuations by averaging within a sliding window of radius 100.

```
end_offset = -2
plt.clf()
ax1 = plt.subplot(1, 2, 1)
ax1.plot(ranks[:end_offset],
us_to_global_ratios_diff[:(end_offset+1)])
ax1.figure.set_size_inches(12, 4)
ax1.set_xlabel("Rank by USGR")
ax1.set_ylabel("USGR First Difference")
ax1.set_title("Unsmoothed")
ax2 = plt.subplot(1, 2, 2)
ax2.plot(ranks[:end_offset],
us_to_global_ratios_diff_smooth[:(end_offset+1)])
ax2.figure.set_size_inches(12, 4)
ax2.set_xlabel("Rank by USGR")
ax2.set_ylabel("USGR First Difference")
ax2.set_title("Smoothed (radius = 100)")
plt.show()
```

Finally, we want to look at what rank cutoffs we’d get if we had used the USGR first differences as the basis of detecting the boundary between mostly-non-English and mostly-English skills. For this, we look for the ranks at which the smoothed USGR first difference first crosses a set of thresholds between 0.00001 and 0.00014 (namely, 0.00001, 0.00004, 0.00007, 0.00008, 0.00011, and 0.00014).

`kept_skills_by_us_diffs_cutoff[::5]`

We see the same familiar pattern we got with our other methods. Cutoffs smaller than ~4000 still contain a fair number of foreign language words, but the number decreases substantially and drops to zero by the time we get to the 4000-4500 range. Observing this pattern emerge from yet another method should give us further confidence that our final conclusion that a cutoff of approximately 4500 is a reasonable one, well-supported by the data.

In the end, we identified English-language skills using the rank cutoff of 4477 that we got from the local non-English fraction approach. It worked well for our purposes in that we no longer saw a substantial number of foreign language skills in the clusters that we examined and delivered to clients. I think this is a cool example to think about because it illustrates how we can use a fairly limited amount of information (the USGR, and a pretty noisy off-the-shelf language classifier) to solve an important practical problem. In this case, it just required some creativity in the way we combined the information at our disposal, as well as poking around the data to see what we could find. That approach allowed us not to annotate data and train our own classifier (very time-consuming) or carefully hand-crafting rules to detect the main foreign-language skills in our data (less time-consuming but still involving a lot of human time).