High-Dimensional Data Representations and Metrics for Machine Learning and Data Mining

Radovanović Miloš

Please use this identifier to cite or link to this item: https://open.uns.ac.rs/handle/123456789/27732

Title:	High-Dimensional Data Representations and Metrics for Machine Learning and Data Mining Reprezentacije i metrike za mašinsko učenje i analizu podataka velikih dimenzija
Authors:	Radovanović Miloš
Keywords:	Machine learning, data mining, information retrieval, text categorization, curse of dimensionality, concentration, nearest neighbors, classification, semi-supervised learn-ing, clustering, time series, vector space mode;Mašinsko učenje, data mining, traženje informacija, kategorizacija teksta, prokletstvo dimenzionalnosti, koncentracija, najbliži susedi, klasifikacija, polu-supervizirano učenje, klasterizacija, vremenske serije, model vektorskog prostora
Issue Date:	11-Feb-2011
Publisher:	Univerzitet u Novom Sadu, Prirodno-matematički fakultet u Novom Sadu University of Novi Sad, Faculty of Sciences at Novi Sad
Abstract:	<p>In the current information age, massive amounts of data are gathered, at a rate prohibiting their effective structuring, analysis, and conversion into useful knowledge. This information overload is manifested both in large numbers of data objects recorded in data sets, and large numbers of attributes, also known as high dimensionality. This dis-sertation deals with problems originating from high dimensionality of data representation, referred to as the “curse of dimensionality,” in the context of machine learning, data mining, and information retrieval. The described research follows two angles: studying the behavior of (dis)similarity metrics with increasing dimensionality, and exploring feature-selection methods, primarily with regard to document representation schemes for text classification. The main results of the dissertation, relevant to the first research angle, include theoretical insights into the concentration behavior of cosine similarity, and a detailed analysis of the phenomenon of hubness, which refers to the tendency of some points in a data set to become hubs by being in-cluded in unexpectedly many <em>k</em>-nearest neighbor lists of other points. The mechanisms behind the phenomenon are studied in detail, both from a theoretical and empirical perspective, linking hubness with the (intrinsic) dimensionality of data, describing its interaction with the cluster structure of data and the information provided by class la-bels, and demonstrating the interplay of the phenomenon and well known algorithms for classification, semi-supervised learning, clustering, and outlier detection, with special consideration being given to time-series classification and information retrieval. Results pertaining to the second research angle include quantification of the interaction between various transformations of high-dimensional document representations, and feature selection, in the context of text classification.</p> <p>U tekućem &bdquo;informatičkom dobu“, masivne količine podataka se<br />sakupljaju brzinom koja ne dozvoljava njihovo efektivno strukturiranje,<br />analizu, i pretvaranje u korisno znanje. Ovo zasićenje informacijama<br />se manifestuje kako kroz veliki broj objekata uključenih<br />u skupove podataka, tako i kroz veliki broj atributa, takođe poznat<br />kao velika dimenzionalnost. Disertacija se bavi problemima koji<br />proizilaze iz velike dimenzionalnosti reprezentacije podataka, često<br />nazivanim &bdquo;prokletstvom dimenzionalnosti“, u kontekstu ma&scaron;inskog<br />učenja, data mining-a i information retrieval-a. Opisana istraživanja<br />prate dva pravca: izučavanje pona&scaron;anja metrika (ne)sličnosti u odnosu<br />na rastuću dimenzionalnost, i proučavanje metoda odabira atributa,<br />prvenstveno u interakciji sa tehnikama reprezentacije dokumenata za<br />klasifikaciju teksta. Centralni rezultati disertacije, relevantni za prvi<br />pravac istraživanja, uključuju teorijske uvide u fenomen koncentracije<br />kosinusne mere sličnosti, i detaljnu analizu fenomena habovitosti koji<br />se odnosi na tendenciju nekih tačaka u skupu podataka da postanu<br />habovi tako &scaron;to bivaju uvr&scaron;tene u neočekivano mnogo lista k najbližih<br />suseda ostalih tačaka. Mehanizmi koji pokreću fenomen detaljno su<br />proučeni, kako iz teorijske tako i iz empirijske perspektive. Habovitost<br />je povezana sa (latentnom) dimenzionalno&scaron;ću podataka, opisana<br />je njena interakcija sa strukturom klastera u podacima i informacijama<br />koje pružaju oznake klasa, i demonstriran je njen efekat na<br />poznate algoritme za klasifikaciju, semi-supervizirano učenje, klastering<br />i detekciju outlier-a, sa posebnim osvrtom na klasifikaciju vremenskih<br />serija i information retrieval. Rezultati koji se odnose na<br />drugi pravac istraživanja uključuju kvantifikaciju interakcije između<br />različitih transformacija vi&scaron;edimenzionalnih reprezentacija dokumenata<br />i odabira atributa, u kontekstu klasifikacije teksta.</p>
URI:	https://open.uns.ac.rs/handle/123456789/27732
Appears in Collections:	PMF Teze/Theses

Show full item record

Page view(s)

32

Last Week
9

Last month
0

checked on May 10, 2024

Google Scholar^TM

Check

Page view(s)

Google ScholarTM

Google Scholar^TM