creating Capsule Wardrobes from Fashion Images.pdf

发布时间：2022-06-19 发布人：admin 分类：说明书资料大小：4.48M 资料格式：pdf 举报版权申诉

e267b186-bdde-49dd-af3d-523742dd4a2f.pdf-第1页.png

第1页 / 共10页

e267b186-bdde-49dd-af3d-523742dd4a2f.pdf-第2页.png

第2页 / 共10页

e267b186-bdde-49dd-af3d-523742dd4a2f.pdf-第3页.png

第3页 / 共10页

e267b186-bdde-49dd-af3d-523742dd4a2f.pdf-第4页.png

第4页 / 共10页

e267b186-bdde-49dd-af3d-523742dd4a2f.pdf-第5页.png

第5页 / 共10页

e267b186-bdde-49dd-af3d-523742dd4a2f.pdf-第6页.png

第6页 / 共10页

e267b186-bdde-49dd-af3d-523742dd4a2f.pdf-第7页.png

第7页 / 共10页

e267b186-bdde-49dd-af3d-523742dd4a2f.pdf-第8页.png

第8页 / 共10页

文本预览

Creating Capsule Wardrobes from Fashion Images Wei-Lin Hsiao UT-Austin Kristen Grauman UT-Austin kimhsiao@cs.utexas.edu grauman@cs.utexas.edu Abstract We propose to automatically create capsule wardrobes. Given an inventory of candidate garments and accessories, the algorithm must assemble a minimal set of items that pro- vides maximal mix-and-match outﬁts. We pose the task as a subset selection problem. To permit efﬁcient subset selec- tion over the space of all outﬁt combinations, we develop submodular objective functions capturing the key ingredi- ents of visual compatibility, versatility, and user-speciﬁc preference. Since adding garments to a capsule only ex- pands its possible outﬁts, we devise an iterative approach to allow near-optimal submodular function maximization. Finally, we present an unsupervised approach to learn vi- sual compatibility from “in the wild” full body outﬁt pho- tos; the compatibility metric translates well to cleaner cat- alog photos and improves over existing methods. Our re- sults on thousands of pieces from popular fashion websites show that automatic capsule creation has potential to mimic skilled fashionistas in assembling ﬂexible wardrobes, while being signiﬁcantly more scalable. 1. Introduction The fashion domain is a magnet for computer vision. New vision problems are emerging in step with the fash- ion industry’s rapid evolution towards an online, social, and personalized business. Style models [22, 38, 34, 17, 27], trend forecasting [1], interactive search [44, 24], and rec- ommendation [41, 18, 31] all require visual understanding with rich detail and subtlety. Research in this area is poised to have great inﬂuence on what people buy, how they shop, and how the fashion industry analyzes its enterprise. A capsule wardrobe is a set of garments that can be as- sembled into many visually compatible outﬁts (see Fig. 1). Capsules are currently the purview of fashion experts, mag- azine editors, and bloggers. A capsule creator manually an- alyzes an inventory to puzzle together a relatively small set of clothing items and accessories that mix and match well. A curated capsule can help consumers get the best value for their dollars (“do more with less”), help vendors pro- Figure 1: A capsule wardrobe is a minimal set of garments that can mix and match to compose many visually compatible outﬁts. pose appealing wardrobes from their catalogs, and help a subscription service (e.g., StitchFix, LeTote) ship a targeted box that ampliﬁes a customer’s wardrobe. We propose to automate capsule wardrobe generation. There are two key technical challenges. First, capsules hinge on having an accurate model of visual compatibil- ity. Whereas visual similarity asks “what looks like this?”, and is fairly well understood [11, 32, 42, 30], compatibil- ity instead asks “what complements this?” It requires cap- turing how multiple visual items interact, often according to subtle visual properties. Existing compatibility meth- ods [41, 14, 35, 16, 31, 39] assume supervision via labels or co-purchase data, which limits scope and precision, as we will discuss in Sec. 3. Furthermore, they largely cater only to pairwise compatibility. The second challenge is that cap- sule generation is a complex combinatorial problem. Of all possible garments, we seek the subset that maximizes ver- satility and compatibility, and, critically, the addition of any one garment introduces multiple new outﬁt combinations. We introduce an approach to automatically create cap- sule wardrobes that addresses both of these issues. We ﬁrst cast capsule creation as a subset selection problem. We de- 1 Capsule piecesOutfit #1Outfit #2Outfit #3Outfit #4Outfit #5In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

ﬁne an objective function characterizing a capsule based on its pieces’ mutual compatibility, the resulting outﬁts’ versa- tility, and (optionally) its faithfulness to a user’s preferred style. Then, we develop an efﬁcient algorithm that maps a large inventory of candidate garments into the best capsule of the desired size. We design objectives that are submod- ular for the addition of new outﬁts, ensuring the “diminish- ing returns” property that facilitates near-optimal set selec- tion [36, 28]. Then, since each garment added to a capsule expands the possible outﬁts, we further develop an iterative approach that exploits outﬁt submodularity to alternate be- tween ﬁxing and selecting each layer of clothing. As a second main contribution, we introduce an unsuper- vised approach to learn visual compatibility from full-body images “in the wild”. We learn a generative model for outﬁt compositions from unlabeled images that can score k-way compatibility. Because it is built on predicted attributes, our model can translate compatibility learned from the “in the wild” photos to cleaner catalog photos of individual items, where users need most guidance on mixing and matching. We evaluate our approach on thousands of garments from Polyvore, popular social commerce websites for fash- ion. We compare our algorithm’s capsule creations to those manually deﬁned by fashionistas, as well as subjective user studies. Furthermore, we show our underlying compatibil- ity model offers advantages over some state of the art meth- ods. Finally, we demonstrate the practical value of our al- gorithm, which in seconds ﬁnds near-optimal capsules for problem scales that are otherwise intractable. 2. Related Work Attributes for fashion Attributes offer a natural repre- sentation for clothing, since they can describe relevant pat- terns (checked, paisley), colors (rose, teal), ﬁt (loose), and cut (V-neck, ﬂowing) [4, 2, 8, 43, 6, 23, 33]. Topic models on attributes are indicative of styles [20, 40, 17]. Inspired by [17], we employ topic models. However, whereas [17] seeks a style-coherent image embedding, we use correlated topic models to score novel combinations of garments for their compatibility. Domain adaptation [6, 19] and multi- task curriculum learning [9] are valuable to overcome the gap between street and shop photos. We devise a simple curriculum learning approach to train attributes effectively in our setting. None of the above methods explore visual compatibility or capsule wardrobes. Style and fashionability Beyond recognition tasks, fashion also demands answering: How do we represent style? What makes an outﬁt fashionable? The style of an outﬁt is typically learned in a supervised manner. Lever- aging style-labeled data like HipsterWars [22] or Deep- Fashion [33], classiﬁers built on body keypoints [22], weak meta-data [38], or contextual embeddings [27] show promise. Fashionability refers speciﬁcally to a style’s pop- ularity. It can also be learned from supervised data, e.g., online data for user “likes” [29, 37]. Unsupervised style discovery methods instead mine unlabeled photos to detect common themes in people’s outﬁts, with topic models [17], non-negative matrix factorization [1], or clustering [34]. We also leverage unlabeled images to discover “what people wear”; however, our goal is to infer visual compatibility for unseen garments, rather than trend analysis [1, 34] or image retrieval [17] on a ﬁxed corpus. Compatibility and recommendation Substantial prior work explores ways to link images containing the same or very similar garment [11, 32, 42, 21, 30]. In contrast, com- patibility requires judging how well-coordinated or comple- mentary a given set of garments is. Compatibility can be posed as a metric learning problem [41, 35, 16], addressable with Siamese embeddings [41] or link prediction [35]. Text data can aid compatibility [29, 39, 14]. As an alternative to metric learning, a recurrent neural network models outﬁt composition as a sequential process that adds one garment at a time, implicitly learning compatibility via the transition function [14]. Compatibility has applications in recommen- dation [31, 18], but prior work recommends a garment at a time, as opposed to constructing a wardrobe. To our knowledge, all prior work requires labeled data to learn compatibility, whether from human annotators cu- rating matches [14, 18], co-purchase data [41, 35, 16], or implicit crowd labels [29]. In contrast, we propose an un- supervised approach, which has the advantages of scalabil- ity, privacy, and continually refreshable models as fashion evolves, and also avoids awkwardly generating “negative” training pairs (see Sec. 3). Most importantly, our work is the ﬁrst to develop an algorithm for generating capsule wardrobes. Capsules require going beyond pairwise com- patibility to represent k-way interactions and versatility, and they present a challenging combinatorial problem. Subset selection We pose capsule wardrobe generation as a subset selection problem. Probabilistic determinantal point processes (DPP) can identify the subset of items that maximize individual item “quality” while also maximizing total “diversity” of the set [25], and have been applied for document and video summarization [25, 12]. Alternatively, submodular function maximization exploits “diminishing returns” to select an optimal subset subject to a budget [36]. For submodular objectives, an efﬁcient greedy selection cri- terion is near optimal [36], e.g., as exploited for sensor placement [13] and outbreak detection [28]. We show how to adapt such solutions to permit accurate and efﬁcient se- lection for capsule wardrobes; furthermore, we develop an iterative EM-like algorithm to enable non-submodular ob- jectives for mix-and-match outﬁts.

3. Approach We ﬁrst formally deﬁne the capsule wardrobe problem and introduce our approach (Sec. 3.1). Then in Sec. 3.2 we present our unsupervised approach to learn compatibil- ity and personalized styles, two key ingredients in capsule wardrobes. Finally, in Sec. 3.3, we overview our training procedure for cross-domain attribute recognition. 3.1. Subset selection for capsule wardrobes i s0 i , s1 i , ..., sNi−1 A capsule wardrobe is a minimal set of garments that combine in versatile ways to create many compatible out- ﬁts (see Fig. 1). We cast capsule creation as the problem of selecting a subset from a large set of candidates that maxi- mizes quality (compatibility) and diversity (versatility). 3.1.1 Problem formulation and objective We formulate the subset selection problem as follows. Let i = 0, . . . , (m − 1) index the m layers of clothing (e.g., outerwear, upper body, lower body, hosiery). Let denote the set of candidate gar- Ai = ments/pieces in layer i, where sj i , j = 0, . . . , (Ni − 1) is the j-th piece in layer i, and Ni is the number of candidate pieces for that layer. For example, the candidates could be the inventory of a given catalog. If an outﬁt is composed of one and only one piece from each layer, the candidate pieces in total could generate a set Y of Objective To form a capsule wardrobe, we must select only T pieces, AiT = ⊆ Ai from each layer i. The set of outﬁts y generated by these pieces consists of A0T × A1T × . . . × A(m−1)T . Our goal is to select the pieces A∗iT ,∀i such that their composed set of outﬁts y∗ is maximally compatible and versatile. Fig. 2 visualizes this problem. i Ni possible outﬁts. sj1 i , ..., sjT i To this end, we deﬁne our objective as: y∗ = argmax C(y) + V (y), y⊆Y (1) s.t. y = A0T × A1T × . . . × A(m−1)T T where C(y) and V (y) denote the compatibility and versa- tility scores, respectively. A na¨ıve approach to ﬁnd the optimal solution y∗ re- quires computation on T m outﬁts in a subset, multiplying m to search through all possible subsets. Since our byN candidate pool may consist of all merchandise in a shopping site, N may be on the order of hundreds or thousands, so op- timal solutions become intractable. Fortunately, our key in- sight is that as wardrobes expand, subsequent outﬁts add di- minishing amounts of new styles/looks. This permits a sub- modular objective that allows us to obtain a near-optimal solution efﬁciently. In particular, greedily growing a set for subset selection is near-optimal if the objective function is submodular; the greedy algorithm is guaranteed to reach a Figure 2: Selecting a subset of pieces from the candidates to form a capsule wardrobe. Left shows candidates from all layers. The selected pieces are checked and compose the subset on the right. e , or about solution achieving at least a constant fraction 1− 1 63%, of the optimal score [36]. Deﬁnition 3.1 Submodularity. A set function F is submod- ular if, ∀D ⊆ B ⊆ V , ∀s ∈ V \ B, F (D ∪{s})− F (D) ≥ F (B ∪ {s}) − F (B). Submodularity satisﬁes diminishing returns. Since it is closed under nonnegative linear combinations, if we design C(y) and V (y) to be submodular, our ﬁnal objective will be submodular as well, as we show next. We stress that items in a subset y are outﬁts—not gar- ments/pieces. The capsule is the Cartesian product of all garments selected per layer. Therefore, when greedily growing a set at time step t, an incremental addition of one garment sjt i=i Ai(t−1) new outﬁts. While ultimately the algorithm must add gar- ments to the capsule, for the sake of optimization, it needs to reason about the set in terms of those garments’ combi- natorial outﬁts. We address this challenge below. i from layer i entails adding sjt i × Compatibility Suppose we have an algorithm that re- turns the compatibility estimate c(oj) of outﬁt oj (to be de- ﬁned in Sec. 3.2). We deﬁne a set’s compatibility score as: C(y) := Σoj∈yc(oj), (2) the sum of compatibility scores for all its outﬁts. C(y) is modular, a special case of submodularity, since an addi- tional outﬁt oj to any set y will increase C(y) by the same amount c(oj). Versatility A good capsule wardrobe should offer a va- riety of looks, or styles, for different uses and occasions. We formalize versatility as a coverage function over all styles: V (y) := ΣK i=1vy(zi) (3) where vy(zi) measures the degree to which outﬁts in y cover the i-th desired style zi, and K is the total number of distinct styles. Sec. 3.2 will deﬁne our model for styles zi. To satisfy the diminishing returns property, we deﬁne vy(i) probabilistically: vy(zi) := 1 − (1 − P (zi|oj)), (4) oj∈y OuterLower………………CandidateCapsule……Layersi=m1i=0s00sj0s10sN010s0m1s1m1sjm1sN(m1)1m1A0A0TA(m1)A(m1)Tsj10sjT0sjTm1sj1m1

where P (zi|oj) denotes the probability of a style zi given an outﬁt oj. We deﬁne a generative model for P (zi|oj) be- low. The idea is that each outﬁt “tries” to cover a style with probability P (zi|oj), and the style is covered by a capsule if at least one of the outﬁts in it successfully covers that style. Thus, as y expands, subsequent outﬁts add diminish- ing amounts of style coverage. A probabilistic expression for coverage is also used in [10] for blog posts. We have thus far deﬁned versatility in terms of uniform coverage over all K styles. However, each user has his/her own preferences, and a universally versatile capsule may contain pieces that do not meet one’s taste. Thus, as a per- sonalized variant of our approach, we adjust each style’s proportion in the coverage function by a user’s style prefer- ence. This extends our capsules with personalized versatil- ity: i=1wivy(zi), V (y) := ΣK (5) where wi denotes a personalized preference for each style i. Sec. 3.2 explains how the personalization weights are discovered from user data. 3.1.2 Optimization A key challenge of subset selection for capsule wardrobes is that our subsets are on outﬁts, but we must form the sub- set by selecting garments. With each garment addition, the subset of outﬁts y grows superlinearly, since every new gar- ment can combine with all previous garments to form new outﬁts. Submodularity requires each addition to diminish a set function’s gain, but adding more garments yields more outﬁts, so the gain actually increases. Thus, while our ob- jective is submodular for adding outﬁts, it is not submodular for adding individual garments. However, we can make the following claim: Claim 3.2 When ﬁxing all other layers (i.e., upper, lower, outer) and selecting a subset of pieces one layer at a time, the probabilistic versatility coverage function in Eqn (3) is submodular, and the compatibility function in Eqn (2) is modular. See Supplementary File for proof. Thus, given a single layer, our objective function is sub- modular for garments. By ﬁxing all selected pieces in other layers, any additional garment will be combined with the same set of garments and form the same amount of new outﬁts. Thus subsets in a given layer no longer grow super- linearly. So the guarantee of a greedy solution on that layer being near-optimal [36] still holds. To exploit this, we develop an EM-like iterative approach to approximate a greedy solution over all layers: we itera- tively ﬁx the subsets selected in other layers, and focus the current selection in a single layer. After sufﬁcient iterations, our subsets converge to a ﬁxed set. Algorithm 1 gives the complete steps. Our algorithm is quite efﬁcient. Whereas a na¨ıve search would take more than 1B hours for our data with N = 150, cur := 0 Reset selected pieces in layer i ε is the tolerance degree for convergence for each layer i = 0, 1, ...(m − 1) do yt−1 = Ai(t−1) × obj ≥ ε do AiT = Ai0 := ∅ obji for each time step t = 1, 2, ...T do i=i AiT δs Algorithm 1 Proposed iterative greedy algorithm for sub- modular maximization, where obj(y) := C(y) + V (y). 1: AiT := ∅, ∀i 2: ∆obj := ε + 1 3: objm−1 prev := 0 4: while ∆m−1 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: end while 19: procedure INCREMENTAL ADDITION (yt := yt−1 s) 20: 21: 22: 23: 24: 25: 26: 27: end procedure t := s, s ∈ Ai \ Ai(t−1) y+ for j ∈ {1, . . . , m} , j = i do if AjT = ∅ then := argmaxs∈Ai\Ai(t−1) sjt i where δs = obj(yt−1 s) − obj(yt−1) Ait := sjt obji end for i ∪ Ai(t−1) cur + δ end for ∆m−1 objm−1 := objm−1 prev := objm−1 cur − objm−1 prev y+ t := y+ end if end for yt := yt−1 ∪ y+ t cur := obji sjt i Max increment Update layer i obj cur t × AjT our algorithm returns an approximate capsule in only 200 seconds. Most computation is devoted to computing the ob- jective function, which requires topic model inference (see below). A na¨ıve greedy approach on garments would re- quire O(N T 4) time for m = 4 layers, while our itera- tive approach requires O(N T 3) time per iteration (details in Supp.) For our datasets, it requires just 5 iterations. 3.2. Style topic models for compatibility Having deﬁned the capsule selection objective and opti- mization, we now present our approach to model versatility (via P (zi|oj)) and compatibility c(oj) simultaneously. Prior work on compatibility takes a supervised approach. Given ground truth compatible items (either by manual la- bels like curated sets of product images on Polyvore [29, 39, 14] or by using Amazon co-purchase data as a proxy [35, 41, 16]), a (usually discriminative) compatibility metric is trained. See Fig. 3. However, the supervised strategy has weaknesses. First, items purchased at the same time can be a weak proxy for visual compatibility. Second, user-created sets often focus on the visualization of the collage, usually contain fewer than two main (non-accessory) pieces, and lack layers like hosiery. Third, sources like Amazon and Polyvore are limited to brands selected by vendors, a frac- tion of the wide variety of clothing people wear in real life. Fourth, obtaining the negative non-compatible examples re-

Figure 3: L to R: Amazon co-purchase example; Polyvore user- curated set; Chictopia full-body outﬁt (our training source). quired by supervised discriminative methods is problem- atic. Previous work [35, 41, 16, 29, 39] generates negative examples by randomly swapping items in positive pairs, but there is no guarantee that the random combinations are true negatives. Not observing a pair of items together does not necessarily mean they do not go well together. To address these issues, we propose a generative compat- ibility model that is learned from unlabeled images of peo- ple wearing outﬁts “in the wild” (Fig. 3, right). We explore a topic model—namely Correlated Topic Models (CTM) [26] —from text analysis. CTM is a Bayesian multinomial mix- ture model that supposes a small number of K latent topics account for the distribution of observed words in any given document. It uses the following generative process for a corpus D consisting of M documents each of length Li: 1. Choose ηi ∼ N (µ, Σ), where i ∈ {1, . . . , M} and µ, Σ a K-dimensional mean and covariance matrix. θik = eηik k=1eηik maps ηi to a simplex. 2. Choose ϕk ∼ Dir(β), where k ∈ {1, . . . , K} and Dir(β) is the Dirichlet distribution with parameter β. 3. For each word indexed by (i, j), ΣK where j ∈ {1, . . . , Li}, and i ∈ {1, . . . , M}. (a) Choose a topic zi,j ∼ Multinomial(θi). (b) Choose a word xi,j ∼ Multinomial(ϕzi,j ). Only the word occurrences are observed. Following [17], we map textual topic models to visual ones: a “document” is an outﬁt, a “word” is an inferred visual attribute (e.g., ﬂo- ral, chiffon), and a “topic” is a style. The model discovers the compositions of visual cues (i.e., attributes) that char- acterize styles. A topic might capture plaid blue blouses, or tight leather skirts. Prior work [17] models styles with a Dirichlet prior, which treats topics within an image as inde- pendent. For compatibility, we ﬁnd CTM’s logistic normal prior above [26] beneﬁcial to account for style correlations (e.g., a formal blazer is more likely to be combined with a skirt than sporty leggings.) CTM estimates the latent variables by maximizing the Li posterior distribution given a corpus D: p(θ, z|D, µ, Σ, β) = p(θi|µ, Σ) M i=1 j=1 p(zij|θi)p(xij|zij, β). (6) First we ﬁnd the latent variables that ﬁt assembled outﬁts on full-body images. Next, given an arbitrary combination of catalog pieces, we predict their attributes, and take the union of attributes on all pieces to form an outﬁt oj. Finally we infer its compatibility by the likelihood: c(oj) := p(oj|µ, Σ, β). (7) Combinations similar to previously assembled outﬁts D will have higher probability. For this generative model, no negative examples need to be contrived. The training pool should be outﬁts like those we want the model to emulate; we use full-body photos posted to a fashion website. Given a database of unlabeled outﬁt images, we pre- dict their attributes. Then we apply CTM to obtain a set of styles, where each style k is an attribute distribution ϕk. Given an outﬁt oj, CTM infers its style composition θoj = [θoj 1, . . . , θoj K]. Then we have: P (zi|oj) := P (zi|θoj ) = θoj i, (8) p p which is used to compute our versatility coverage in Eq (4). We personalize the styles emphasized for a user in Eq (5) as follows. Given a collection of outﬁts {oj1 , . . . , ojU} owned by a user p, e.g., as shown in that user’s purchase his- tory catalog photos or his/her album on a fashion website, we learn his/her style preference θ(user) by aggregating all U Σjθoj . Hence, a user’s personalized outﬁts θ(user) weight wi for each style i is θ(user) = 1 Unlike previous work that uses supervision from human created matches or co-purchase information, our model is fully unsupervised. While we train attribute models on a disjoint pool of attribute labeled images, our topic model runs on “inferred” attributes, and annotators do not touch the images from which we learn compatibility. 3.3. Cross-domain attribute recognition pi . Finally, we describe our approach to infer cross-domain attributes on both catalog and outﬁt images. Vocabulary and data collection Since ﬁne-grained at- tributes are valuable for style, we build our vocabulary on the 195 attributes enumerated in [17]. Their dataset has at- tributes labeled only on outﬁt images, so we collect cata- log images labeled with the same attribute vocabulary. To gather images, we use keyword search with the attribute name on Google, then manually prune those where the at- tribute is absent. This yields 100 to 300 positive training images per attribute, in total 12K images. (Details in Supp.) Curriculum learning from shop to street Catalog im- ages usually have a clear background, and are free of tricky lighting conditions and deformations from body pose. As a result, attributes are more readily recognizable in cat- alog images than full-body outﬁt images. We propose a two stage curriculum learning approach for cross do- main attribute recognition. In the ﬁrst stage, we ﬁnetune

Figure 4: Qualitative examples of most (left) and least (right) compatible outﬁts as scored by our model. Here we show both outﬁts with 2 pieces (top) and 3 pieces (bottom). a deep neural network, ResNet-50 [15] pretrained on Ima- geNet [7], for catalog attribute recognition. In the second stage, we ﬁrst detect and crop images into upper and lower body instances, and then we ﬁnetune the network from the ﬁrst stage for outﬁt attribute recognition. Evaluating on the validation split from the 19K image dataset [17], we see a signiﬁcant 15% mAP improvement, especially on chal- lenging attributes such as material and neckline, which are subtle or occupy a small region on outﬁt images. 4. Experiments We ﬁrst evaluate our compatibility estimation in isola- tion (Sec. 4.1). Then, we evaluate our algorithm’s capsule wardrobes, for both quality and efﬁciency (Sec. 4.2). 4.1. Compatibility Dataset Previous work [29, 39, 14] studying com- patibility each collects a dataset from polyvore.com. Polyvore is a platform where fashion-conscious users create sets of clothing pieces that go well with each other. While these compatible sets are valuable supervision, as discussed above, collecting “incompatible” outﬁts is problematic. While our method uses no “incompatible” examples for training, to facilitate evaluation, we devise a more reli- able mechanism to avoid false negatives. We collect 3, 759 Polyvore outﬁts, composed of 7, 478 pieces, each with meta-labels such as season (winter, spring, summer, fall), occasion (work, vacation), and function (date, hike). Tab. 1 summarizes the dataset breakdown. We exploit the meta- labels to generate incompatible outﬁts. For each compatible outﬁt, we generate an incompatible one by randomly swap- ping one piece to another piece in the same layer from an exclusive meta-label. For example, each winter (work) out- ﬁt will swap a piece with a summer (vacation) outﬁt. We use outﬁts that have at least 2 pieces from different layers as positives, and for each positive outﬁt, we generate 5 neg- atives. In total, our test set has 2, 574 positives and 12, 870 negatives. In short, by swapping with the guidance of meta- label, the negatives are more likely to be true negatives. See Supp. for examples. fall winter spring summer vacation work date hike total 505 791 508 3759 454 514 413 2752 177 96 146 847 302 275 130 307 227 71 308 206 60 731 421 66 total sets 307 > 1 itm. 242 > 2 itm. 101 Table 1: Breakdown of our Polyvore dataset. Baselines We compare with two recent methods: i) MONOMER [16], an embedding trained using Amazon product co-purchase info as a proxy label for compatibil- ity. Since Monomer predicts only pairwise scores between pieces, we average all pairwise scores to get an outﬁt’s total compatibility, following [14]. ii) BILSTM [14], a sequen- tial model trained on user-created sets from Polyvore to pre- dict the held-out piece of a layer given pieces from other layers. The probability of the whole sequence is its com- patibility metric. For both baselines, we use the authors’ provided code and their same training data sources. Implementation We collect 3, 957 “in the wild” out- ﬁt images from chictopia.com to learn compatibility. We apply the cross domain curriculum learning (Sec. 3.3) to predict their attributes, then ﬁt the topic model [5] (Sec. 3.2). Given an outﬁt consisting of Polyvore garment product images, we predict attributes per piece then pool them for the whole outﬁt, and infer the outﬁt’s compatibil- ity. Since both Monomer [16] and BiLSTM [14] are de- signed to run on per-garment catalog images, training them on our full-body outﬁt dataset is not possible. Results Fig. 5 (left) compares our compatibility to al- ternative topic models. As one baseline, we extend the polylingual LDA (POLYLDA) style discovery [17] to com- pute compatibility with likelihood, analogous to our CTM- based approach. Among the three topic model variants, LDA performs worst. PolyLDA [17] learns styles across all body parts, and thus an outﬁt with incompatible pieces will be modeled by multiple topics, lowering its likelihood. However, PolyLDA is still ignorant of correlation between styles, and our model fares better. compares to Monomer [16] and BiLSTM [14]. Our model outperforms both existing techniques. Monomer assumes pairwise relations, but many outﬁts consist of more than two pieces; aggregating pairwise relations fails to accurately capture an compatibility (right) Fig. our 5 Most compatibleLeast compatible

0.3 0.2 0.1 n o i s i c e r P 0.3 0.2 0.1 n o i s i c e r P Ours 0.199 LDA 0.148 PolyLDA 0.177 0 0 0.5 Recall 1 0 0 Ours 0.199 BiLSTM 0.184 Monomer 0.157 0.5 Recall 1 Figure 5: Compatibility accuracy comparisons. Legend shows AP. Left: topic model variants, including PolyLDA styles [17]. Right: state-of-the-art compatibility models [16, 14]. outﬁt’s compatibility as a whole. Like us, BiLSTM learns compatibility from only positives, yet we perform better. This supports our idea to learn compatibility from “in the wild” full body images. Fig. 4 shows qualitative results of most and least compat- ible outﬁts scored by our model. Its most compatible outﬁts tend to include basic pieces—so-called staples—with neu- tral colors and low-key patterns or materials. Basic pieces go well with almost anything, making them strong compat- ibles. Our model picks up on the fact that people’s daily outﬁts usually consist of at least one such basic piece. Out- ﬁts we infer to be incompatible tend to have incongruous color palettes or mismatched ﬁt/cut. 4.2. Capsule wardrobe creation Having validated our compatibility module, we next evaluate our selection algorithm for capsule wardrobes. We i) Adding— consider two scenarios for likely use cases: given a seed outﬁt, optimize the pieces to add, augmenting the starter wardrobe; and ii) Personalizing—given a set of outﬁts the user likes/has worn/owns, optimize a new capsule from scratch to meet his/her taste. Dataset To construct the pool of candidate pieces, we select N = 150 pieces each for the outer, upper, and lower layers and N = 50 for the one piece layer from the 7, 478 pieces in the Polyvore data. The pool size represents the scale of a typical online clothing vendor. As seed outﬁts, we use those that have pieces in all outer, upper, lower layers, resulting in 759 seed outﬁts. We report results averaged over all 759 seed initializations. We consider capsules with T = 4 pieces in each of the m = 3 layers. This gives 12 pieces and 64 outﬁts per capsule. Baselines The baselines try to ﬁnd staples or prototypi- cal pieces, while at the same time avoiding near duplicates. Speciﬁcally: i) MMR [3]: widely used function in infor- mation retrieval that strikes a balance between “relevance” and “diversity”, scoring items by λRel + (1 − λ)Div. We use our model’s p(s|µ, Σ, β) of a single piece s to mea- sure MMR relevance, and the visual dissimilarity between selected pieces as diversity. ii) CLUSTER CENTERS: clus- Compatibility (↓) Versatility (↑) Cluster Center MMR-λ0.3 MMR-λ0.5 MMR-λ0.7 na¨ıve greedy Iterative 1.16 3.05 2.95 2.12 0.88 0.83 0.55 3.09 2.85 2.08 0.84 0.78 Table 2: Capsules scored by human-created gold standard. Figure 6: Two example capsules created by different methods for the same seed outﬁts (shown offset to left). Titles show method and compatibility/versatility scores. ters the pieces in each layer to T clusters and then selects a representative piece from each cluster and layer. We cluster with k-medoids in the 2048-D feature space from the last layer of our catalog attribute CNN. Capsule creation quality First, we compare all meth- ods by measuring how much their capsules differ from human-curated outﬁts. The gold standard outﬁts are the Polyvore sets. We measure the visual distance of each outﬁt to its nearest neighbor in the gold standard; the smaller the distance, the more compatible. A capsule’s compatibility is the summed distances of its outﬁts. We score capsule versa- tility piece-wise: we compute the piece-wise visual distance per layer, and sum all distances. All distances are computed on the 2048-D CNN features and normalized by the σ of all distances. We stress that both metrics are independent of our model’s learned compatibility and styles; success is achieved by matching the human-curated gold standard, not merely by optimizing our objectives C(y) and V (y). Tab. 2 shows the results. Our iterative-greedy is near- est to the gold standard for compatibility, and MMR fares worst. Tuning λ in MMR as high as 0.7 to emphasize its rel- One pieceTopBottomOuteriterative-greedy (0.34/0.42)naive-greedy (1.11/1.07)MMR-0.7 (2.19/2.14)One pieceTopBottomOuteriterative-greedy (1.07/0.22)naive-greedy (0.60/0.82)cluster center 1.16/0.55

Objective Obj./Optimal Obj.(%) Time 100 76 87 40.8 30.8 35.5 131.1 sec 34.3 sec 57.9 sec Optimal na¨ıve greedy Iterative Table 3: Quality vs. run-time per capsule for na¨ıve and iterative greedy maximization, compared to the true optimal solution. Here we run at toy scale (N = 10) so that optimal is tractable. Our iterative algorithm better approximates the optimal solution, yet is much faster. Run at a realistic scale (N = 150), optimal is intractable (∼1B hours), while our algorithm takes only 200 sec. evance term helps increase its accuracy, as it includes more staples. However, an undesirable effect of MMR (hidden by these numbers) is that for a given λ value, the capsules are almost always the same, independent of the initial seed out- ﬁt. This is due to MMR’s diversity term being vulnerable to outliers. Hence, while MMR outperforms our method on versatility, it fails to create capsules coherent with the seed wardrobe (see Fig. 6, bottom right). Clustering has accept- able compatibility, but low versatility (see Fig. 6, top right). Personalized capsules Next we demonstrate our ap- proach to tailor a capsule for a user’s taste. As a proof of concept, we select two users from chictopia.com, and use 200 photos in their albums to learn the users’ style pref- erence (see end of Sec. 3.2). All 7, 478 pieces are treated as candidates. Fig. 7 shows the personalized capsules gener- ated by our algorithm. User 1’s album suggests she prefers lady-like looks, and accordingly our algorithm creates a capsule with pastel colors and chiffon material. In contrast, user 2 prefers street, punk looks, and our algorithm creates a capsule with denim, leather material, and dark colors. See Supp. for a comparison to nearest neighbor image retrieval. Iterative submodular vs. na¨ıve greedy algorithm Next, we compare our iterative greedy algorithm—which properly accounts for the superlinear growth of capsules as garments are added—to a baseline greedy algorithm that na¨ıvely employs submodular function maximization, ignor- ing the combinatorics of introducing each new garment. To verify that our iterative approach better approximates the optimal solution in practice, we create a toy experiment with N = 10 candidates and T = 3 selections per layer. We stress the scale of this experiment is limited only by the need to compute the true optimal solution. All algorithms create capsules from scratch. We run all methods on the same single Intel Xeon 2.66Ghz machine. Tab. 3 shows the results. Our iterative algorithm achieves 87% of the optimal objective function value, a clear margin better than na¨ıve at 76%. On the toy dataset, solving capsule wardrobes by brute force takes ∼ 2× our run-time. Run at the realistic scale of our experiments above (N = 150) the brute force solution is intractable, ∼1B hours per capsule, but our solution takes only 200 sec. Comparative human subject study How do humans perceive the gap between iterative and na¨ıve greedy’s re- Figure 7: Personalized capsules tailored for user preference. sults? To analyze this, we perform a human perception study with 14 subjects. Since the data contains women’s clothing, all subjects are female, and they range in age from 20’s to 60’s. Using a Web form, we present 50 randomly sampled pairs of capsules (iterative vs. na¨ıve greedy), dis- played in random order per question. Then for each pair, the subjects must select which is better. See Supp. for interface. We use majority vote and weight by their conﬁdence scores. Despite the fact that na¨ıve greedy leverages the same com- patibility and versatility functions, 59% of the time the sub- jects prefer our iterative algorithm’s capsules. This shows the impact of the proposed optimization. 5. Conclusion Computer vision can play an increasing role in fashion applications, which are, after all, inherently visual. Our work explores capsule wardrobe generation. The proposed approach offers new insights in terms of both efﬁcient opti- mization for combinatorial mix-and-match outﬁt selection, as well as generative learning of visual compatibility. Fur- thermore, we demonstrate that compatibility learned in the wild can successfully translate to clean product images via attributes and a simple curriculum learning method. Future work will explore ways to optimize attribute vocabularies for capsules. Acknowledgements: We thank Yu-Chuan Su and Chao-Yuan Wu for helpful discussions. We also thank our human subjects: Karen, Carol, Julie, Cindy, Thaoni, Michelle, Ara, Jennifer, Ann, Yen, Chelsea, Mongchi, Sara, Tiffany. This research is supported in part by an Amazon Research Award and NSF IIS-1514118. We thank Texas Advanced Computing Center for their generous support. Tailored capsuleTailored capsule

分享到：

赞收藏

资料库

creating Capsule Wardrobes from Fashion Images.pdf

相关推荐

人工智能

热门标签

最新资料