This is a site outlining some of the work my group did for our CSE291 project. We took the popular dimension reduction technique UMAP and decided to see if we could incorporate categorical as well as numeric data. This is a fairly common problem in data analysis and machine learning, as it is non-trivial to come up with a useful distance metric between observations on mixed feature types.
Here, we use a modified version of Gower distance (Gower, J.C. 1971) implemented in the daisy
package in R.
The dataset we chose to use is the Pokemon dataset found here. For our Gower analysis, we dropped Name, Number, Generation
.
We used the Python implemenation of UMAP on this data, using metric="precomputed"
for the Gower distance matrix.
Packages used:
- pandas
- numpy
- seaborn
- matplotlib
- bokeh
- PIL
- umap
Results can be found here.